Results 1  10
of
16
Applied Numerical Linear Algebra
 Society for Industrial and Applied Mathematics
, 1997
"... We survey general techniques and open problems in numerical linear algebra on parallel architectures. We rst discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing e cient algorithms. We illustrate ..."
Abstract

Cited by 532 (26 self)
 Add to MetaCart
We survey general techniques and open problems in numerical linear algebra on parallel architectures. We rst discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing e cient algorithms. We illustrate these principles using current architectures and software systems, and by showing how one would implement matrix multiplication. Then, we present direct and iterative algorithms for solving linear systems of equations, linear least squares problems, the symmetric eigenvalue problem, the nonsymmetric eigenvalue problem, and the singular value decomposition. We consider dense, band and sparse matrices.
Parallelizing While Loops for Multiprocessor Systems
 IN PROCEEDINGS OF THE 9TH INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM
, 1995
"... Current parallelizing compilers treat while loops and do loops with conditional exits as sequential constructs because their iteration space is unknown. Motivated by the fact that these types of loops arise frequently in practice, we have developed techniques that can be used to automatically transf ..."
Abstract

Cited by 31 (13 self)
 Add to MetaCart
Current parallelizing compilers treat while loops and do loops with conditional exits as sequential constructs because their iteration space is unknown. Motivated by the fact that these types of loops arise frequently in practice, we have developed techniques that can be used to automatically transform them for parallel execution. We succeed in parallelizing loops involving linked lists traversals  something that has not been done before. This is an important problem since linked list traversals arise frequently in loops with irregular access patterns, such as sparse matrix computations. The methods can even be applied to loops whose data dependence relations cannot be analyzed at compiletime. We outline a cost/performance analysis that can be used to decide when the methods should be applied. Since, as we show, the expected speedups are significant, our conclusion is that they should almost always be applied  providing there is sufficient parallelism available in the original loop. We present experimental results on loops from the PERFECT Benchmarks and sparse matrix packages which substantiate our conclusion that these techniques can yield significant speedups.
Runtime Parallelization: A Framework for Parallel Computation
, 1995
"... The goal of parallelizing, or restructuring, compilers is to detect and exploit parallelism in sequential programs written in conventional languages. Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, statically analyzable access patterns. Howev ..."
Abstract

Cited by 16 (8 self)
 Add to MetaCart
The goal of parallelizing, or restructuring, compilers is to detect and exploit parallelism in sequential programs written in conventional languages. Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, statically analyzable access patterns. However, if the memory access pattern of the program is input data dependent, then static data dependence analysis and consequently parallelization is impossible. Moreover, in this case the compiler cannot apply privatization and reduction parallelization, the transformations that have been proven to be the most effective in removing data dependences and increasing the amount of exploitable parallelism in the program. Typical examples of irregular, dynamic applications are complex simulations such as SPICE for circuit simulation, DYNA3D for structural mechanics modeling, DMOL for quantum mechanical simulation of molecules, and CHARMM for molecular dynamics simulation of organic systems. Therefore, since irregular programs represent a large and important fraction of applications, an automatable framework for runtime parallelization is needed to complement existing and future static compiler techniques. In this thesis,
Theory, Techniques, And Experiments In Solving Recurrences In Computer Programs
, 1997
"... ... work. In the sixth chapter, we consider the application of these same techniques focused on obtaining parallelism in outer timestepping loops. In the final chapter, we draw this work to a conclusion and discuss future directions in parallelizing compiler technology. ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
... work. In the sixth chapter, we consider the application of these same techniques focused on obtaining parallelism in outer timestepping loops. In the final chapter, we draw this work to a conclusion and discuss future directions in parallelizing compiler technology.
Acceleration of First and Higher Order Recurrences on Processors with Instruction Level Parallelism
 In Sixth International Workshop on Languages and Compilers for Parallel Computing
, 1993
"... This report describes parallelization techniques for accelerating a broad class of recurrences on processors with instruction level parallelism. We introduce a new technique, called blocked backsubstitution, which has lower operation count and higher performance than previous methods. The blocked b ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
This report describes parallelization techniques for accelerating a broad class of recurrences on processors with instruction level parallelism. We introduce a new technique, called blocked backsubstitution, which has lower operation count and higher performance than previous methods. The blocked backsubstitution technique requires unrolling and nonsymmetric optimization of innermost loop iterations. We present metrics to characterize the performance of softwarepipelined loops and compare these metrics for a range of height reduction techniques and processor architectures.
Stability of parallel triangular system solvers
 SIAM J. Sci. Comput
, 1995
"... And by contacting: ..."
Solving Linear Recurrences with Loop Raking
, 1992
"... We present a variation of the partition method for solving linear recurrences that is wellsuited to vector multiprocessors. The algorithm fully utilizes both vector and multiprocessor capabilities, and reduces the number of memory accesses and temporary memory requirements as compared to the more co ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
We present a variation of the partition method for solving linear recurrences that is wellsuited to vector multiprocessors. The algorithm fully utilizes both vector and multiprocessor capabilities, and reduces the number of memory accesses and temporary memory requirements as compared to the more commonly used version of the partition method. Our variation uses a general loop restructuring technique called loop raking. We describe an implementation of this technique on the CRAY YMP C90, and present performance results for first and secondorder linear recurrences. On a single processor of the C90 our implementations are up to 7.3 times faster than the corresponding optimized library routines in SCILIB, an optimized mathematical library supplied by Cray Research. On 4 processors, we gain an additional speedup of at least 3.7. List of symbols 1 infinity \Phi plus within circle \Omega times within circle b; c floor ß approximately equal to Lib Partition F77 Raked 1 Proc 4 Procs Lo...
Parallel Ocean Circulation Modeling on Cedar
, 1991
"... The simulation of ocean circulation is a problem of great importance in the environmental sciences of today. Realistic simulations impose great demands on the existing computer systems, requiring the use of considerable computational power and storage capabilities. An implementation of an ocean gene ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
The simulation of ocean circulation is a problem of great importance in the environmental sciences of today. Realistic simulations impose great demands on the existing computer systems, requiring the use of considerable computational power and storage capabilities. An implementation of an ocean general circulation model on the Cedar multicluster architecture is presented. This is based on the GFDL threedimensional model, that was adapted for simulation of the Mediterranean. The model simulates the basic aspects of largescale, baroclinic ocean circulation, including treatment of irregular bottom topography. The data and computational mapping strategies and the effect on the performance are discussed. The Cedar version of the code, using four clusters and 32 processors, has demonstrated significant speedup compared to a single cluster and compared to a single processor. iv To my parents, Julio and Niomar De Rose v ACKNOWLEDGEMENTS This work was supported by the U.S. Department of E...
Developments and Trends in the Parallel Solution of Linear Systems
 Parallel Computing
, 1999
"... In this review paper, we consider some important developments and trends in algorithm design for the solution of linear systems concentrating on aspects that involve the exploitation of parallelism. We briefly discuss the solution of dense linear systems, before studying the solution of sparse equat ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
In this review paper, we consider some important developments and trends in algorithm design for the solution of linear systems concentrating on aspects that involve the exploitation of parallelism. We briefly discuss the solution of dense linear systems, before studying the solution of sparse equations by direct and iterative methods. We consider preconditioning techniques for iterative solvers and discuss some of the present research issues in this field. Keywords: linear systems, dense matrices, sparse matrices, tridiagonal systems, parallelism, direct methods, iterative methods, Krylov methods, preconditioning. AMS(MOS) subject classifications: 65F05, 65F50. 1 Introduction Solution methods for systems of linear equations Ax = b; (1) where A is a coefficient matrix of order n and x and b are nvectors, are usually grouped into two distinct classes: direct methods and iterative methods. However, CCLRC  Rutherford Appleton Laboratory, Oxfordshire, England and CERFACS, Toulouse,...
Computing Programs Containing Band Linear Recurrences on Vector Supercomputers
 IEEE Trans. on Parallel and Distributed Systems
, 1992
"... Many largescale scientific and engineering computations, e.g., some of the Grand Challenge problems [1], spend a major portion of execution time in their core loops computing band linear recurrences (BLR's). Conventional compiler parallelization techniques [4] cannot generate scalable parallel code ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Many largescale scientific and engineering computations, e.g., some of the Grand Challenge problems [1], spend a major portion of execution time in their core loops computing band linear recurrences (BLR's). Conventional compiler parallelization techniques [4] cannot generate scalable parallel code for this type of computation because they respect loopcarried dependences (LCD's) in programs and there is a limited amount of parallelism in a BLR with respect to LCD's. For many applications, using library routines to replace the core BLR requires the separation of BLR from its dependent computation, which usually incurs significant overhead. In this paper, we present a new scalable algorithm, called the Regular Schedule, for parallel evaluation of BLR's. We describe our implementation of the Regular Schedule and discuss how to obtain maximummemory throughput in implementing the schedule on vector supercomputers. We also illustrate our approach, based on our Regular Schedule, to parallel...