Results 1  10
of
21
Parallel Numerical Linear Algebra
, 1993
"... We survey general techniques and open problems in numerical linear algebra on parallel architectures. We first discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing efficient algorithms. We illust ..."
Abstract

Cited by 773 (26 self)
 Add to MetaCart
We survey general techniques and open problems in numerical linear algebra on parallel architectures. We first discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing efficient algorithms. We illustrate these principles using current architectures and software systems, and by showing how one would implement matrix multiplication. Then, we present direct and iterative algorithms for solving linear systems of equations, linear least squares problems, the symmetric eigenvalue problem, the nonsymmetric eigenvalue problem, and the singular value decomposition. We consider dense, band and sparse matrices.
Optimizing the performance of sparse matrixvector multiplication
, 2000
"... Copyright 2000 by EunJin Im ..."
(Show Context)
Scalable and Modular Algorithms for FloatingPoint Matrix Multiplication on FPGAs
 In Proc. of The 18th International Parallel & Distributed Processing Symposium
, 2004
"... The abundant hardware resources on current FPGAs provide new opportunities to improve the performance of hardware implementations of scientific computations. In this paper, we propose two FPGAbased algorithms for floatingpoint matrix multiplication, a fundamental kernel in a number of scientific a ..."
Abstract

Cited by 59 (11 self)
 Add to MetaCart
(Show Context)
The abundant hardware resources on current FPGAs provide new opportunities to improve the performance of hardware implementations of scientific computations. In this paper, we propose two FPGAbased algorithms for floatingpoint matrix multiplication, a fundamental kernel in a number of scientific applications. We analyze the design tradeoffs in implementing this kernel on FPGAs. Our algorithms employ a linear array architecture with a small control logic. This architecture effectively utilizes the hardware resources on the entire FPGA and reduces the routing complexity. The processing elements(PEs) used in our algorithms are modular so that floatingpoint units can be easily embedded into them. In our designs, the floatingpoint units are optimized to maximize the number of PEs integrated on the FPGA as well as the clock speed. Experimental results show that our algorithms achieve high clock speeds and provide good scalability. Our algorithms achieve superior sustained floatingpoint performance compared with existing FPGAbased implementations and stateoftheart processors. 1
Towards a theory of cacheefficient algorithms
 PROCEEDINGS OF THE SYMPOSIUM ON DISCRETE
, 2000
"... We present a model that enables us to analyze the running time of an algorithm on a computer with a memory hierarchy with limited associativity, in terms of various cache parameters. Our cache model, an extension of Aggarwal and Vitter’s I/O model, enables us to establish useful relationships betw ..."
Abstract

Cited by 56 (3 self)
 Add to MetaCart
We present a model that enables us to analyze the running time of an algorithm on a computer with a memory hierarchy with limited associativity, in terms of various cache parameters. Our cache model, an extension of Aggarwal and Vitter’s I/O model, enables us to establish useful relationships between the cache complexity and the I/O complexity of computations. As a corollary, we obtain cacheefficient algorithms in the singlelevel cache model for fundamental problems like sorting, FFT, and an important subclass of permutations. We also analyze the averagecase cache behavior of mergesort, show that ignoring associativity concerns could lead to inferior performance, and present supporting experimental evidence. We further extend our model to multiple levels of cache with limited associativity and present optimal algorithms for matrix transpose and sorting. Our techniques may be used for systematic
Optimizing Graph Algorithms for Improved Cache Performance
 In Proc. Int’l Parallel and Distributed Processing Symp. (IPDPS 2002), Fort Lauderdale, FL
, 2002
"... In this paper, we develop algorithmic optimizations to improve the cache performance of four fundamental graph algorithms. We present a cacheoblivious implementation of the FloydWarshall Algorithm for the fundamental graph problem of allpairs shortest paths by relaxing some dependencies in the it ..."
Abstract

Cited by 48 (0 self)
 Add to MetaCart
In this paper, we develop algorithmic optimizations to improve the cache performance of four fundamental graph algorithms. We present a cacheoblivious implementation of the FloydWarshall Algorithm for the fundamental graph problem of allpairs shortest paths by relaxing some dependencies in the iterative version. We show that this implementation achieves the lower bound on processormemory traffic of Ω(N 3 / C), where N and C are the problem size and cache size respectively. Experimental results show that this cacheoblivious implementation shows more than 6× improvement in real execution time over that of the iterative implementation with the usual row major data layout, on three stateoftheart architectures. Secondly, we address Dijkstra's algorithm for the singlesource shortest paths problem and Prim's algorithm for Minimum Spanning Tree. For these algorithms, we demonstrate up to 2 × improvement in real execution time by using a simple cachefriendly graph representation, namely adjacency arrays. Finally, we address the matching algorithm for bipartite graphs. We show performance improvements of 2 × 3 × in real execution time by using the technique of making the algorithm initially work on subproblems to generate a suboptimal solution and then solving the whole problem using the suboptimal solution as a starting point.
Efficient OutofCore Algorithms for Linear Relaxation Using Blocking Covers (Extended Abstract)
"... When a numerical computation fails to fit in the primary memory of a serial or parallel computer, a socalled "outofcore" algorithm must be used which moves data between primary and secondary memories. In this paper, we study outofcore algorithms for sparse linear relaxation problems ..."
Abstract

Cited by 33 (6 self)
 Add to MetaCart
When a numerical computation fails to fit in the primary memory of a serial or parallel computer, a socalled "outofcore" algorithm must be used which moves data between primary and secondary memories. In this paper, we study outofcore algorithms for sparse linear relaxation problems in which each iteration of the algorithm updates the state of every vertex in a graph with a linear combination of the states of its neighbors. We give a general method that can save substantially on the I/O traffic for many problems. For example, our technique allows a computer with M words of primary memory to perform T = \Omega\Gamma M 1=5 ) cycles of a multigrid algorithm for a twodimensional elliptic solver over an npoint domain using only \Theta(nT =M 1=5 ) I/O transfers, as compared with the naive algorithm which requires\Omega\Gamma nT ) I/O's. Our method depends on the existence of a "blocking " cover of the graph that underlies the linear relaxation. A blocking cover has the property that the subgraphs forming the cover have large diameters once a small number of vertices have been removed from the graph. The key idea in our method is to introduce a variable for each removed vertex for each time step of the algorithm. We maintain linear dependences among the removed vertices, thereby allowing each subgraph to be iteratively relaxed without external communication. We give a general theorem relating blocking covers to I/Oefficient relaxation schemes. We also give an automatic method for finding blocking cove...
High Performance Linear Algebra Operations on Reconfigurable Systems
, 2005
"... FieldProgrammable Gate Arrays (FPGAs) have become an attractive option for scientific computing. Several vendors have developed high performance reconfigurable systems which employ FPGAs for application acceleration. In this paper, we propose a BLAS (Basic Linear Algebra Subprograms) library for s ..."
Abstract

Cited by 30 (4 self)
 Add to MetaCart
FieldProgrammable Gate Arrays (FPGAs) have become an attractive option for scientific computing. Several vendors have developed high performance reconfigurable systems which employ FPGAs for application acceleration. In this paper, we propose a BLAS (Basic Linear Algebra Subprograms) library for stateoftheart reconfigurable systems. We study three dataintensive operations: dot product, matrixvector multiply and dense matrix multiply. The first two operations are I/O bound, and our designs efficiently utilize the available memory bandwidth in the systems. As these operations require accumulation of sequentially delivered floatingpoint values, we develop a high performance reduction circuit. This circuit uses only one floatingpoint adder and buffers of moderate size. For matrix multiply operation, we propose a design which employs a linear array of FPGAs. This design exploits the memory hierarchy in the reconfigurable systems, and has very low memory bandwidth requirements. To illustrate our ideas, we have implemented our designs for Level 2 and Level 3 BLAS on Cray XD1.
CacheEfficient Matrix Transposition
"... We investigate the memory system performance of several algorithms for transposing an N N matrix inplace, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall runn ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
We investigate the memory system performance of several algorithms for transposing an N N matrix inplace, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall running time of the algorithms. We use various memory models to capture and analyze the effect of various facets of cache memory architecture that guide the choice of a particular algorithm, and attempt to experimentally validate the predictions of the model. Our major conclusions are as follows: limited associativity in the mapping from main memory addresses to cache sets can significantly degrade running time; the limited number of TLB entries can easily lead to thrashing; the fanciest optimal algorithms are not competitive on real machines even at fairly large problem sizes unless cache miss penalties are quite high; lowlevel performance tuning “hacks”, such as register tiling and array alignment, can significantly distort the effects of improved algorithms; and hierarchical nonlinear layouts are inherently superior to the standard canonical layouts (such as row or columnmajor) for
this problem.
Design Tradeoffs for BLAS Operations on Reconfigurable Hardware
 In ICPP ’05: Proceedings of the 2005 International Conference on Parallel Processing
, 2005
"... Numerical linear algebra operations are key primitives in scientific computing. Performance optimizations of such operations have been extensively investigated and some basic operations have been implemented as software libraries. With the rapid advances in technology, hardware acceleration of linea ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
Numerical linear algebra operations are key primitives in scientific computing. Performance optimizations of such operations have been extensively investigated and some basic operations have been implemented as software libraries. With the rapid advances in technology, hardware acceleration of linear algebra applications using FPGAs (Field Programmable Gate Arrays) has become feasible. In this paper, we propose FPGAbased designs for several BLAS operations, including vector product, matrixvector multiply, and matrix multiply. By identifying the design parameters for each BLAS operation, we analyze the design tradeoffs. In the implementations of the designs, the values of the design parameters are determined according to the hardware constraints, such as the available area, the size of onchip memory, the external memory bandwidth and the number of I/O pins. The proposed designs are implemented on a Xilinx VirtexII Pro FPGA. 1
A Parallel Dynamic Programming Algorithm on a Multicore
, 2007
"... Dynamic programming is an efficient technique to solve combinatorial search and optimization problem. There have been many parallel dynamic programming algorithms. The purpose of this paper is to study a family of dynamic programming algorithm where data dependence appear between nonconsecutive sta ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
(Show Context)
Dynamic programming is an efficient technique to solve combinatorial search and optimization problem. There have been many parallel dynamic programming algorithms. The purpose of this paper is to study a family of dynamic programming algorithm where data dependence appear between nonconsecutive stages, in other words, the data dependence is nonuniform. This kind of dynnamic programming is typically called nonserial polyadic dynamic programming. Owing to the nonuniform data dependence, it is harder to optimize this problem for parallelism and locality on parallel architectures. In this paper, we address the chanllenge of exploiting fine grain parallelism and locality of nonserial polyadic dynamic programming on a multicore architecture. We present a programming and execution model for multicore architectures with memory hierarchy. In the framework of the new model, the parallelism and locality benifit from a data dependence transformation. We propose a parallel pipelined algorithm for filling the dynamic programming matrix by decomposing the computation operators. The new parallel algorithm tolerates the memory access latency using multithread and is easily improved with tile technique. We formulate and analytically solve the optimization problem determing the tile size that minimizes the total execution time. The