Results 1 
8 of
8
Recursive Array Layouts and Fast Parallel Matrix Multiplication
 In Proceedings of Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures
, 1999
"... Matrix multiplication is an important kernel in linear algebra algorithms, and the performance of both serial and parallel implementations is highly dependent on the memory system behavior. Unfortunately, due to false sharing and cache conflicts, traditional columnmajor or rowmajor array layouts i ..."
Abstract

Cited by 48 (4 self)
 Add to MetaCart
Matrix multiplication is an important kernel in linear algebra algorithms, and the performance of both serial and parallel implementations is highly dependent on the memory system behavior. Unfortunately, due to false sharing and cache conflicts, traditional columnmajor or rowmajor array layouts incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts for improving the performance of parallel recursive matrix multiplication algorithms. We extend previous work by Frens and Wise on recursive matrix multiplication to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. We show that while recursive array layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.22.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms;...
Tuning Strassen's Matrix Multiplication for Memory Efficiency
 IN PROCEEDINGS OF SC98 (CDROM
, 1998
"... Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this alg ..."
Abstract

Cited by 38 (4 self)
 Add to MetaCart
Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this algorithm that uses several unconventional techniques to make the algorithm memoryfriendly. First, the algorithm internally uses a nonstandard array layout known as Morton order that is based on a quadtree decomposition of the matrix. Second, we dynamically select the recursion truncation point to minimize padding without affecting the performance of the algorithm, which we can do by virtue of the cache behavior of the Morton ordering. Each technique is critical for performance, and their combination as done in our code multiplies their effectiveness. Performance comparisons of our implementation with that of competing implementations show that our implementation often outperforms th...
Recursive Array Layouts and Fast Matrix Multiplication
, 1999
"... The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional columnmajor or rowmajor array layouts to incur high variability in memory system performance as matrix size var ..."
Abstract

Cited by 31 (0 self)
 Add to MetaCart
The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional columnmajor or rowmajor array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. While recursive layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.22.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between ...
Implementation of Strassen's Algorithm for Matrix Multiplication
 In Proceedings of Supercomputing '96
, 1996
"... In this paper we report on the development of an efficient and portable implementation of Strassen's matrix multiplication algorithm. Our implementation is designed to be used in place of DGEMM, the Level 3 BLAS matrix multiplication routine. Efficient performance will be obtained for all matrix siz ..."
Abstract

Cited by 31 (0 self)
 Add to MetaCart
In this paper we report on the development of an efficient and portable implementation of Strassen's matrix multiplication algorithm. Our implementation is designed to be used in place of DGEMM, the Level 3 BLAS matrix multiplication routine. Efficient performance will be obtained for all matrix sizes and shapes and the additional memory needed for temporary variables has been minimized. Replacing DGEMM with our routine should provide a significant performance gain for large matrices while providing the same performance for small matrices. We measure performance of our code on the IBM RS/6000, CRAY YMP C90, and CRAY T3D single processor, and offer comparisons to other codes. Our performance data reconfirms that Strassen's algorithm is practical for realistic size matrices. The usefulness of our implementation is demonstrated by replacing DGEMM with our routine in a large application code.
Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication
, 2012
"... ..."
Exploring the Tera MTA by Example
, 2000
"... This paper studies the design and implementation of the MultiThreading Architecture (MTA) supercomputer produced by the Tera Computer Company. It describes the most salient hardware and software features of this architecture, including lightweight synchronization and hardware/software cooperation. ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
This paper studies the design and implementation of the MultiThreading Architecture (MTA) supercomputer produced by the Tera Computer Company. It describes the most salient hardware and software features of this architecture, including lightweight synchronization and hardware/software cooperation. Special emphasis is placed on the available programming methodolgies (futures, synchronized variables, and implicit parallelism) and their implementation on the Tera hardware. Programs from two very different problem domains, matrix multiplication and dynamic programming, are used to study Tera's development environment. Di erent versions of the same program are compared in order to evaluate the suitability of varying parallelizing techniques.
Architectureefficient Strassen's Matrix Multiplication: A Case Study of DivideandConquer Algorithms
 In International Linear Algebra Society(ILAS) Symposium on Algorithms for Control, Signals and Image Processing
, 1997
"... Many fast algorithms in arithmetic complexity have hierarchical or recursive structures that make efficient implementations on high performance computers with memory hierarchies nontrivial. In this paper we present our findings on efficient implementation of Strassen's algorithm[17] for the ubiquit ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Many fast algorithms in arithmetic complexity have hierarchical or recursive structures that make efficient implementations on high performance computers with memory hierarchies nontrivial. In this paper we present our findings on efficient implementation of Strassen's algorithm[17] for the ubiquitous operation of matrix multiplication as a model for a class of recursive algorithms. In comparison to the conventional multiplication algorithm, Strassen's algorithm requires more storage space and exhibits poorer data locality. Although recent years have seen better representation and better implementations of the algorithm, the characterization of the optimization in implementation issues and hence the automatic optimization strategies are lacking. We present our schemes for optimizing data locality and reducing storage space in an increased scope of computation sequences with arithmetic and numerical constraints. Moreover, our characterization of the optimization schemes are based on th...
On Improving the Memory Access Patterns During the Execution of Strassen's Matrix Multiplication Algorithm
"... Matrix multiplication is a basic computing operation. Whereas it is basic, it is also very expensive with a straight forward technique of O(N ) runtime complexity. More complex solutions such as Strassen's algorithm exist that reduce this complexity to O(N ); the recursive nature of such algor ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Matrix multiplication is a basic computing operation. Whereas it is basic, it is also very expensive with a straight forward technique of O(N ) runtime complexity. More complex solutions such as Strassen's algorithm exist that reduce this complexity to O(N ); the recursive nature of such algorithms place a large burden on memory systems due to temporary storage and the lack of locality in their access patterns In this paper we propose a scheme for reordering the matrix entries stored in memory. This reordering provides two major benefits: a simple method to transform the recursive algorithm into an iterative one, and also a simple method for maintaining memory locality over the entire operation. These two features both provide an improvement in performance that grows as the problem size increases.