Results 1 - 10
of
15
Parallel Numerical Linear Algebra
- Society for Industrial and Applied Mathematics
, 1997
"... We survey general techniques and open problems in numerical linear algebra on parallel architectures. We first discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing efficient algorithms. We illust ..."
Abstract
-
Cited by 416 (23 self)
- Add to MetaCart
We survey general techniques and open problems in numerical linear algebra on parallel architectures. We first discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing efficient algorithms. We illustrate these principles using current architectures and software systems, and by showing how one would implement matrix multiplication. Then, we present direct and iterative algorithms for solving linear systems of equations, linear least squares problems, the symmetric eigenvalue problem, the nonsymmetric eigenvalue problem, the singular value decomposition, and generalizations of these to two matrices. We consider dense, band and sparse matrices.
Parallelizing While Loops for Multiprocessor Systems
- IN PROCEEDINGS OF THE 9TH INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM
, 1995
"... Current parallelizing compilers treat while loops and do loops with conditional exits as sequential constructs because their iteration space is unknown. Motivated by the fact that these types of loops arise frequently in practice, we have developed techniques that can be used to automatically transf ..."
Abstract
-
Cited by 29 (13 self)
- Add to MetaCart
Current parallelizing compilers treat while loops and do loops with conditional exits as sequential constructs because their iteration space is unknown. Motivated by the fact that these types of loops arise frequently in practice, we have developed techniques that can be used to automatically transform them for parallel execution. We succeed in parallelizing loops involving linked lists traversals --- something that has not been done before. This is an important problem since linked list traversals arise frequently in loops with irregular access patterns, such as sparse matrix computations. The methods can even be applied to loops whose data dependence relations cannot be analyzed at compile-time. We outline a cost/performance analysis that can be used to decide when the methods should be applied. Since, as we show, the expected speedups are significant, our conclusion is that they should almost always be applied --- providing there is sufficient parallelism available in the original loop. We present experimental results on loops from the PERFECT Benchmarks and sparse matrix packages which substantiate our conclusion that these techniques can yield significant speedups.
Run-time Parallelization: A Framework for Parallel Computation
, 1995
"... The goal of parallelizing, or restructuring, compilers is to detect and exploit parallelism in sequential programs written in conventional languages. Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, statically analyzable access patterns. Howev ..."
Abstract
-
Cited by 16 (8 self)
- Add to MetaCart
The goal of parallelizing, or restructuring, compilers is to detect and exploit parallelism in sequential programs written in conventional languages. Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, statically analyzable access patterns. However, if the memory access pattern of the program is input data dependent, then static data dependence analysis and consequently parallelization is impossible. Moreover, in this case the compiler cannot apply privatization and reduction parallelization, the transformations that have been proven to be the most effective in removing data dependences and increasing the amount of exploitable parallelism in the program. Typical examples of irregular, dynamic applications are complex simulations such as SPICE for circuit simulation, DYNA-3D for structural mechanics modeling, DMOL for quantum mechanical simulation of molecules, and CHARMM for molecular dynamics simulation of organic systems. Therefore, since irregular programs represent a large and important fraction of applications, an automatable framework for run-time parallelization is needed to complement existing and future static compiler techniques. In this thesis,
Theory, Techniques, And Experiments In Solving Recurrences In Computer Programs
, 1997
"... ... work. In the sixth chapter, we consider the application of these same techniques focused on obtaining parallelism in outer time-stepping loops. In the final chapter, we draw this work to a conclusion and discuss future directions in parallelizing compiler technology. ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
... work. In the sixth chapter, we consider the application of these same techniques focused on obtaining parallelism in outer time-stepping loops. In the final chapter, we draw this work to a conclusion and discuss future directions in parallelizing compiler technology.
Acceleration of First and Higher Order Recurrences on Processors with Instruction Level Parallelism
- In Sixth International Workshop on Languages and Compilers for Parallel Computing
, 1993
"... This report describes parallelization techniques for accelerating a broad class of recurrences on processors with instruction level parallelism. We introduce a new technique, called blocked back-substitution, which has lower operation count and higher performance than previous methods. The blocked b ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
This report describes parallelization techniques for accelerating a broad class of recurrences on processors with instruction level parallelism. We introduce a new technique, called blocked back-substitution, which has lower operation count and higher performance than previous methods. The blocked back-substitution technique requires unrolling and non-symmetric optimization of innermost loop iterations. We present metrics to characterize the performance of software-pipelined loops and compare these metrics for a range of height reduction techniques and processor architectures.
Stability of parallel triangular system solvers
- SIAM J. Sci. Comput
, 1995
"... And by contacting: ..."
Solving Linear Recurrences with Loop Raking
, 1992
"... We present a variation of the partition method for solving linear recurrences that is wellsuited to vector multiprocessors. The algorithm fully utilizes both vector and multiprocessor capabilities, and reduces the number of memory accesses and temporary memory requirements as compared to the more co ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
We present a variation of the partition method for solving linear recurrences that is wellsuited to vector multiprocessors. The algorithm fully utilizes both vector and multiprocessor capabilities, and reduces the number of memory accesses and temporary memory requirements as compared to the more commonly used version of the partition method. Our variation uses a general loop restructuring technique called loop raking. We describe an implementation of this technique on the CRAY Y-MP C90, and present performance results for first- and second-order linear recurrences. On a single processor of the C90 our implementations are up to 7.3 times faster than the corresponding optimized library routines in SCILIB, an optimized mathematical library supplied by Cray Research. On 4 processors, we gain an additional speedup of at least 3.7. List of symbols 1 infinity \Phi plus within circle \Omega times within circle b; c floor ß approximately equal to Lib Partition F77 Raked 1 Proc 4 Procs Lo...
Parallel Ocean Circulation Modeling on Cedar
, 1991
"... The simulation of ocean circulation is a problem of great importance in the environmental sciences of today. Realistic simulations impose great demands on the existing computer systems, requiring the use of considerable computational power and storage capabilities. An implementation of an ocean gene ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
The simulation of ocean circulation is a problem of great importance in the environmental sciences of today. Realistic simulations impose great demands on the existing computer systems, requiring the use of considerable computational power and storage capabilities. An implementation of an ocean general circulation model on the Cedar multicluster architecture is presented. This is based on the GFDL three-dimensional model, that was adapted for simulation of the Mediterranean. The model simulates the basic aspects of large-scale, baroclinic ocean circulation, including treatment of irregular bottom topography. The data and computational mapping strategies and the effect on the performance are discussed. The Cedar version of the code, using four clusters and 32 processors, has demonstrated significant speedup compared to a single cluster and compared to a single processor. iv To my parents, Julio and Niomar De Rose v ACKNOWLEDGEMENTS This work was supported by the U.S. Department of E...
Developments and Trends in the Parallel Solution of Linear Systems
- Parallel Computing
, 1999
"... In this review paper, we consider some important developments and trends in algorithm design for the solution of linear systems concentrating on aspects that involve the exploitation of parallelism. We briefly discuss the solution of dense linear systems, before studying the solution of sparse equat ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In this review paper, we consider some important developments and trends in algorithm design for the solution of linear systems concentrating on aspects that involve the exploitation of parallelism. We briefly discuss the solution of dense linear systems, before studying the solution of sparse equations by direct and iterative methods. We consider preconditioning techniques for iterative solvers and discuss some of the present research issues in this field. Keywords: linear systems, dense matrices, sparse matrices, tridiagonal systems, parallelism, direct methods, iterative methods, Krylov methods, preconditioning. AMS(MOS) subject classifications: 65F05, 65F50. 1 Introduction Solution methods for systems of linear equations Ax = b; (1) where A is a coefficient matrix of order n and x and b are n-vectors, are usually grouped into two distinct classes: direct methods and iterative methods. However, CCLRC - Rutherford Appleton Laboratory, Oxfordshire, England and CERFACS, Toulouse,...
Computing Programs Containing Band Linear Recurrences on Vector Supercomputers
- IEEE Trans. on Parallel and Distributed Systems
, 1992
"... Many large-scale scientific and engineering computations, e.g., some of the Grand Challenge problems [1], spend a major portion of execution time in their core loops computing band linear recurrences (BLR's). Conventional compiler parallelization techniques [4] cannot generate scalable parallel code ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Many large-scale scientific and engineering computations, e.g., some of the Grand Challenge problems [1], spend a major portion of execution time in their core loops computing band linear recurrences (BLR's). Conventional compiler parallelization techniques [4] cannot generate scalable parallel code for this type of computation because they respect loop-carried dependences (LCD's) in programs and there is a limited amount of parallelism in a BLR with respect to LCD's. For many applications, using library routines to replace the core BLR requires the separation of BLR from its dependent computation, which usually incurs significant overhead. In this paper, we present a new scalable algorithm, called the Regular Schedule, for parallel evaluation of BLR's. We describe our implementation of the Regular Schedule and discuss how to obtain maximummemory throughput in implementing the schedule on vector supercomputers. We also illustrate our approach, based on our Regular Schedule, to parallel...

