Results 1  10
of
16
Optimizing the performance of sparse matrixvector multiplication
, 2000
"... Copyright 2000 by EunJin Im ..."
Scalable and Modular Algorithms for FloatingPoint Matrix Multiplication on FPGAs
 In Proc. of The 18th International Parallel & Distributed Processing Symposium
, 2004
"... The abundant hardware resources on current FPGAs provide new opportunities to improve the performance of hardware implementations of scientific computations. In this paper, we propose two FPGAbased algorithms for floatingpoint matrix multiplication, a fundamental kernel in a number of scientific a ..."
Abstract

Cited by 48 (10 self)
 Add to MetaCart
The abundant hardware resources on current FPGAs provide new opportunities to improve the performance of hardware implementations of scientific computations. In this paper, we propose two FPGAbased algorithms for floatingpoint matrix multiplication, a fundamental kernel in a number of scientific applications. We analyze the design tradeoffs in implementing this kernel on FPGAs. Our algorithms employ a linear array architecture with a small control logic. This architecture effectively utilizes the hardware resources on the entire FPGA and reduces the routing complexity. The processing elements(PEs) used in our algorithms are modular so that floatingpoint units can be easily embedded into them. In our designs, the floatingpoint units are optimized to maximize the number of PEs integrated on the FPGA as well as the clock speed. Experimental results show that our algorithms achieve high clock speeds and provide good scalability. Our algorithms achieve superior sustained floatingpoint performance compared with existing FPGAbased implementations and stateoftheart processors. 1
Towards a theory of cacheefficient algorithms
 PROCEEDINGS OF THE SYMPOSIUM ON DISCRETE
, 2000
"... We present a model that enables us to analyze the running time of an algorithm on a computer with a memory hierarchy with limited associativity, in terms of various cache parameters. Our cache model, an extension of Aggarwal and Vitter’s I/O model, enables us to establish useful relationships betw ..."
Abstract

Cited by 47 (3 self)
 Add to MetaCart
We present a model that enables us to analyze the running time of an algorithm on a computer with a memory hierarchy with limited associativity, in terms of various cache parameters. Our cache model, an extension of Aggarwal and Vitter’s I/O model, enables us to establish useful relationships between the cache complexity and the I/O complexity of computations. As a corollary, we obtain cacheefficient algorithms in the singlelevel cache model for fundamental problems like sorting, FFT, and an important subclass of permutations. We also analyze the averagecase cache behavior of mergesort, show that ignoring associativity concerns could lead to inferior performance, and present supporting experimental evidence. We further extend our model to multiple levels of cache with limited associativity and present optimal algorithms for matrix transpose and sorting. Our techniques may be used for systematic
Optimizing Graph Algorithms for Improved Cache Performance
 In Proc. Int’l Parallel and Distributed Processing Symp. (IPDPS 2002), Fort Lauderdale, FL
, 2002
"... In this paper, we develop algorithmic optimizations to improve the cache performance of four fundamental graph algorithms. We present a cacheoblivious implementation of the FloydWarshall Algorithm for the fundamental graph problem of allpairs shortest paths by relaxing some dependencies in the it ..."
Abstract

Cited by 28 (0 self)
 Add to MetaCart
In this paper, we develop algorithmic optimizations to improve the cache performance of four fundamental graph algorithms. We present a cacheoblivious implementation of the FloydWarshall Algorithm for the fundamental graph problem of allpairs shortest paths by relaxing some dependencies in the iterative version. We show that this implementation achieves the lower bound on processormemory traffic of Ω(N 3 / C), where N and C are the problem size and cache size respectively. Experimental results show that this cacheoblivious implementation shows more than 6× improvement in real execution time over that of the iterative implementation with the usual row major data layout, on three stateoftheart architectures. Secondly, we address Dijkstra's algorithm for the singlesource shortest paths problem and Prim's algorithm for Minimum Spanning Tree. For these algorithms, we demonstrate up to 2 × improvement in real execution time by using a simple cachefriendly graph representation, namely adjacency arrays. Finally, we address the matching algorithm for bipartite graphs. We show performance improvements of 2 × 3 × in real execution time by using the technique of making the algorithm initially work on subproblems to generate a suboptimal solution and then solving the whole problem using the suboptimal solution as a starting point.
CacheEfficient Matrix Transposition
"... We investigate the memory system performance of several algorithms for transposing an N N matrix inplace, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall runn ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
We investigate the memory system performance of several algorithms for transposing an N N matrix inplace, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall running time of the algorithms. We use various memory models to capture and analyze the effect of various facets of cache memory architecture that guide the choice of a particular algorithm, and attempt to experimentally validate the predictions of the model. Our major conclusions are as follows: limited associativity in the mapping from main memory addresses to cache sets can significantly degrade running time; the limited number of TLB entries can easily lead to thrashing; the fanciest optimal algorithms are not competitive on real machines even at fairly large problem sizes unless cache miss penalties are quite high; lowlevel performance tuning “hacks”, such as register tiling and array alignment, can significantly distort the effects of improved algorithms; and hierarchical nonlinear layouts are inherently superior to the standard canonical layouts (such as row or columnmajor) for
this problem.
High Performance Linear Algebra Operations on Reconfigurable Systems
, 2005
"... FieldProgrammable Gate Arrays (FPGAs) have become an attractive option for scientific computing. Several vendors have developed high performance reconfigurable systems which employ FPGAs for application acceleration. In this paper, we propose a BLAS (Basic Linear Algebra Subprograms) library for s ..."
Abstract

Cited by 22 (4 self)
 Add to MetaCart
FieldProgrammable Gate Arrays (FPGAs) have become an attractive option for scientific computing. Several vendors have developed high performance reconfigurable systems which employ FPGAs for application acceleration. In this paper, we propose a BLAS (Basic Linear Algebra Subprograms) library for stateoftheart reconfigurable systems. We study three dataintensive operations: dot product, matrixvector multiply and dense matrix multiply. The first two operations are I/O bound, and our designs efficiently utilize the available memory bandwidth in the systems. As these operations require accumulation of sequentially delivered floatingpoint values, we develop a high performance reduction circuit. This circuit uses only one floatingpoint adder and buffers of moderate size. For matrix multiply operation, we propose a design which employs a linear array of FPGAs. This design exploits the memory hierarchy in the reconfigurable systems, and has very low memory bandwidth requirements. To illustrate our ideas, we have implemented our designs for Level 2 and Level 3 BLAS on Cray XD1.
Design Tradeoffs for BLAS Operations on Reconfigurable Hardware
 In ICPP ’05: Proceedings of the 2005 International Conference on Parallel Processing
, 2005
"... Numerical linear algebra operations are key primitives in scientific computing. Performance optimizations of such operations have been extensively investigated and some basic operations have been implemented as software libraries. With the rapid advances in technology, hardware acceleration of linea ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
Numerical linear algebra operations are key primitives in scientific computing. Performance optimizations of such operations have been extensively investigated and some basic operations have been implemented as software libraries. With the rapid advances in technology, hardware acceleration of linear algebra applications using FPGAs (Field Programmable Gate Arrays) has become feasible. In this paper, we propose FPGAbased designs for several BLAS operations, including vector product, matrixvector multiply, and matrix multiply. By identifying the design parameters for each BLAS operation, we analyze the design tradeoffs. In the implementations of the designs, the values of the design parameters are determined according to the hardware constraints, such as the available area, the size of onchip memory, the external memory bandwidth and the number of I/O pins. The proposed designs are implemented on a Xilinx VirtexII Pro FPGA. 1
CacheFriendly Implementations of Transitive Closure
, 2001
"... In this paper we show cachefriendly implementations of the FloydWarshall algorithm for the AllPairs ShortestPath problem. We first compare the best commercial compiler optimizations available with standard cachefriendly optimizations and a simple improvement involving a block layout, which reduc ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
In this paper we show cachefriendly implementations of the FloydWarshall algorithm for the AllPairs ShortestPath problem. We first compare the best commercial compiler optimizations available with standard cachefriendly optimizations and a simple improvement involving a block layout, which reduces TLB misses. We show approximately 15 % improvements using these optimizations. We also develop a general representation, the Unidirectional Space Time Representation, which can be used to generate cachefriendly implementations for a large class of algorithms. We show analytically and experimentally that this representation can be used to minimize level1 and level2 cache misses and TLB misses and therefore exhibits the best overall performance. Using this representation we show a 2x improvement in performance with respect to the compiler optimized implementation. Experiments were conducted on Pentium III, Alpha, and MIPS R12000 machines using problem sizes between 1024 and 2048 vertices. We used the Simplescalar simulator to demonstrate improved cache performance.
Optimum Binary Search Trees On The Hierarchical Memory Model
, 2001
"... The Hierarchical Memory Model (HMM) of computation is similar to the standard Random Access Machine (RAM) model except that the HMM has a nonuniform memory organized in a hierarchy of levels numbered 1 through h. The cost of accessing a memory location increases with the level number, and accesses ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
The Hierarchical Memory Model (HMM) of computation is similar to the standard Random Access Machine (RAM) model except that the HMM has a nonuniform memory organized in a hierarchy of levels numbered 1 through h. The cost of accessing a memory location increases with the level number, and accesses to memory locations belonging to the same level cost the same. Formally, the cost of a single access to the memory location at address a is given by (a), where : N ! N is the memory cost function, and the h distinct values of model the different levels of the memory hierarchy. We study the problem of constructing and storing a binary search tree (BST) of minimum cost, over a set of keys, with probabilities for successful and unsuccessful searches, on the HMM with an arbitrary number of memory levels, and for the special case h = 2. While the problem of constructing optimum binary search trees has been well studied for the standard RAM model, the additional parameter for the HMM inc...
A Parallel Dynamic Programming Algorithm on a Multicore
, 2007
"... Dynamic programming is an efficient technique to solve combinatorial search and optimization problem. There have been many parallel dynamic programming algorithms. The purpose of this paper is to study a family of dynamic programming algorithm where data dependence appear between nonconsecutive sta ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Dynamic programming is an efficient technique to solve combinatorial search and optimization problem. There have been many parallel dynamic programming algorithms. The purpose of this paper is to study a family of dynamic programming algorithm where data dependence appear between nonconsecutive stages, in other words, the data dependence is nonuniform. This kind of dynnamic programming is typically called nonserial polyadic dynamic programming. Owing to the nonuniform data dependence, it is harder to optimize this problem for parallelism and locality on parallel architectures. In this paper, we address the chanllenge of exploiting fine grain parallelism and locality of nonserial polyadic dynamic programming on a multicore architecture. We present a programming and execution model for multicore architectures with memory hierarchy. In the framework of the new model, the parallelism and locality benifit from a data dependence transformation. We propose a parallel pipelined algorithm for filling the dynamic programming matrix by decomposing the computation operators. The new parallel algorithm tolerates the memory access latency using multithread and is easily improved with tile technique. We formulate and analytically solve the optimization problem determing the tile size that minimizes the total execution time. The