Results 1  10
of
171
The Potential of the Cell Processor for Scientific Computing
 CF'06
, 2006
"... The slowing pace of commodity microprocessor performance improvements combined with everincreasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations o ..."
Abstract

Cited by 95 (7 self)
 Add to MetaCart
(Show Context)
The slowing pace of commodity microprocessor performance improvements combined with everincreasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cachebased designs. In this work, we examine the potential of using the forthcoming STI Cell processor as a building block for future highend computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. The difficulty of programming Cell, which requires assembly level intrinsics for the best performance, makes this model useful as an initial step in algorithm design and evaluation. Next, we validate the accuracy of our model by comparing results against published hardware results, as well as our own implementations on the Cell full system simulator. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different mappings of the kernels and demonstrates a simple and effective programming model for Cell’s unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of doubleprecision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.
Summa: Scalable universal matrix multiplication algorithm
, 1997
"... In this paper, we give a straight forward, highly e cient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are performance r ..."
Abstract

Cited by 95 (4 self)
 Add to MetaCart
In this paper, we give a straight forward, highly e cient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are performance results on the Intel Paragon system. 1
Communication Lower Bounds for DistributedMemory Matrix Multiplication
, 2004
"... this paper. More speci cally, we use the de nitions of [10]: (g(n)) is the set of functions f(n) such that there exist positive constants c 1 , c2 , and n0 such that 0 c1 g(n) f(n) c2 g(n) for all n n0 ; O(g(n)) is de ned similarly using the weaker condition 0 f(n) c 2 g(n); g(n)) is de ..."
Abstract

Cited by 85 (1 self)
 Add to MetaCart
(Show Context)
this paper. More speci cally, we use the de nitions of [10]: (g(n)) is the set of functions f(n) such that there exist positive constants c 1 , c2 , and n0 such that 0 c1 g(n) f(n) c2 g(n) for all n n0 ; O(g(n)) is de ned similarly using the weaker condition 0 f(n) c 2 g(n); g(n)) is de ned with the condition 0 c 1 g(n) f(n). The set o(g(n)) consists of functions f(n) such that for any c 2 > 0 there exists a constant n0 > 0 such that 0 f(n) c 2 g(n) for all n n0
Scalable and Modular Algorithms for FloatingPoint Matrix Multiplication on FPGAs
 In Proc. of The 18th International Parallel & Distributed Processing Symposium
, 2004
"... The abundant hardware resources on current FPGAs provide new opportunities to improve the performance of hardware implementations of scientific computations. In this paper, we propose two FPGAbased algorithms for floatingpoint matrix multiplication, a fundamental kernel in a number of scientific a ..."
Abstract

Cited by 59 (11 self)
 Add to MetaCart
(Show Context)
The abundant hardware resources on current FPGAs provide new opportunities to improve the performance of hardware implementations of scientific computations. In this paper, we propose two FPGAbased algorithms for floatingpoint matrix multiplication, a fundamental kernel in a number of scientific applications. We analyze the design tradeoffs in implementing this kernel on FPGAs. Our algorithms employ a linear array architecture with a small control logic. This architecture effectively utilizes the hardware resources on the entire FPGA and reduces the routing complexity. The processing elements(PEs) used in our algorithms are modular so that floatingpoint units can be easily embedded into them. In our designs, the floatingpoint units are optimized to maximize the number of PEs integrated on the FPGA as well as the clock speed. Experimental results show that our algorithms achieve high clock speeds and provide good scalability. Our algorithms achieve superior sustained floatingpoint performance compared with existing FPGAbased implementations and stateoftheart processors. 1
Synthesis of HighPerformance Parallel Programs for a Class of Ab Initio Quantum Chemistry Models
 PROCEEDINGS OF THE IEEE
, 2005
"... ..."
(Show Context)
Communicationoptimal parallel 2.5D matrix multiplication and LU factorization algorithms
"... One can use extra memory to parallelize matrix multiplication by storing p 1/3 redundant copies of the input matrices on p processors in order to do asymptotically less communication than Cannon’s algorithm [2], and be faster in practice [1]. We call this algorithm “3D ” because it arranges the p pr ..."
Abstract

Cited by 42 (21 self)
 Add to MetaCart
(Show Context)
One can use extra memory to parallelize matrix multiplication by storing p 1/3 redundant copies of the input matrices on p processors in order to do asymptotically less communication than Cannon’s algorithm [2], and be faster in practice [1]. We call this algorithm “3D ” because it arranges the p processors in a 3D array, and Cannon’s algorithm “2D ” because it stores a single copy of the matrices on a 2D array of processors. We generalize these 2D and 3D algorithms by introducing a new class of “2.5D algorithms”. For matrix multiplication, we can take advantage of any amount of extra memory to store c copies of the data, for any c ∈{1, 2,..., ⌊p 1/3 ⌋}, to reduce the bandwidth cost of Cannon’s algorithm by a factor of c 1/2 and the latency cost by a factor c 3/2. We also show that these costs reach the lower bounds [13, 3], modulo polylog(p) factors. We similarly generalize LU decomposition to 2.5D and 3D, including communicationavoiding pivoting, a stable alternative to partialpivoting [7]. We prove a novel lower bound on the latency cost of 2.5D and 3D LU factorization, showing that while c copies of the data can also reduce the bandwidth by a factor of c 1/2, the latency must increase by a factor of c 1/2, so that the 2D LU algorithm (c = 1) in fact minimizes latency. Preliminary results of 2.5D matrix multiplication on a Cray XT4 machine also demonstrate a performance gain of up to 3X with respect to Cannon’s algorithm. Careful choice of c also yields up to a 2.4X speedup over 3D matrix multiplication, due to a better balance between communication costs.
Communicationoptimal parallel algorithm for Strassen’s matrix multiplication
 In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12
, 2012
"... Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen’s fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix mul ..."
Abstract

Cited by 32 (21 self)
 Add to MetaCart
(Show Context)
Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen’s fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassenbased, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen’s algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA’11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communicationoptimal. It exhibits perfect strong scaling within the maximum possible range.
Graph Expansion and Communication Costs of Fast Matrix Multiplication
"... The communication cost of algorithms (also known as I/Ocomplexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen’s and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communi ..."
Abstract

Cited by 32 (18 self)
 Add to MetaCart
The communication cost of algorithms (also known as I/Ocomplexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen’s and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communication costs. For sequential algorithms these bounds are attainable and so optimal. 1.