Results 1 -
5 of
5
Tuning Strassen's Matrix Multiplication for Memory Efficiency
- IN PROCEEDINGS OF SC98 (CD-ROM
, 1998
"... Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this alg ..."
Abstract
-
Cited by 34 (4 self)
- Add to MetaCart
Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this algorithm that uses several unconventional techniques to make the algorithm memory-friendly. First, the algorithm internally uses a non-standard array layout known as Morton order that is based on a quad-tree decomposition of the matrix. Second, we dynamically select the recursion truncation point to minimize padding without affecting the performance of the algorithm, which we can do by virtue of the cache behavior of the Morton ordering. Each technique is critical for performance, and their combination as done in our code multiplies their effectiveness. Performance comparisons of our implementation with that of competing implementations show that our implementation often outperforms th...
Research Demonstration of a Hardware Reference-Counting Heap
, 1997
"... A hardware self-managing heap memory (RCM) for languages like LISP, SMALLTALK, and JAVA has been designed, built, tested and benchmarked. On every pointer write from the processor, reference-counting transactions are performed in real time within this memory, and garbage cells are reused without pro ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
A hardware self-managing heap memory (RCM) for languages like LISP, SMALLTALK, and JAVA has been designed, built, tested and benchmarked. On every pointer write from the processor, reference-counting transactions are performed in real time within this memory, and garbage cells are reused without processor cycles. A processor allocates new nodes simply by reading from a distinguished location in its address space. The memory hardware also incorporates support for off-line, multiprocessing, mark-sweep garbage collection. Performance statistics are presented from a partial implementation of SCHEME over five different memory models and two garbage collection strategies, from main memory (no access to RCM) to a fully operational RCM installed on an external bus. The performance of the RCM memory is more than competitive with main memory.
Matrix Algorithms using Quadtrees Invited Talk, ATABLE-92
- in Proc. ATABLE-92
, 1992
"... Many scheduling and synchronization problems for large-scale multiprocessing can be overcome using functional (or applicative) programming. With this observation, it is strange that so much attention within the functional programming community has focused on the "aggregate update problem" [10]: esse ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Many scheduling and synchronization problems for large-scale multiprocessing can be overcome using functional (or applicative) programming. With this observation, it is strange that so much attention within the functional programming community has focused on the "aggregate update problem" [10]: essentially how to implement FORTRAN arrays. This situation is strange because in-place updating of aggregates belongs more to uniprocessing than to mathematics. Several years ago functional style drew me to treatment of d-dimen- sional arrays as 2 d --ary trees; in particular, matrices become quaternary trees or quadtrees. This convention yields efficient recopying-cum-update of any array; recursive, algebraic decomposition of conventional arithmetic algorithms; and uniform representations and algorithms for both dense and sparse matrices. For instance, any nonsingular subtree is a candidate as the pivot block for Gaussian elimination; the restriction actually helps identification of pivot b...
Research Demonstration of a Hardware Reference-Counting Heap
, 1994
"... A hardware self-managing heap memory (RCM) for languages like LISP, SMALLTALK, and JAVA has been designed, built, tested and benchmarked. On every pointer write from the processor, reference-counting transactions are performed in real time within this memory, and garbage cells are reused without pro ..."
Abstract
- Add to MetaCart
A hardware self-managing heap memory (RCM) for languages like LISP, SMALLTALK, and JAVA has been designed, built, tested and benchmarked. On every pointer write from the processor, reference-counting transactions are performed in real time within this memory, and garbage cells are reused without processor cycles. A processor allocates new nodes simply by reading from a distinguished location in its address space. The memory hardware also incorporates support for off-line, multiprocessing, mark-sweep garbage collection. Performance statistics are presented from a partial implementation of SCHEME over five different memory models and two garbage collection strategies, from main memory (no access to RCM) to a fully operational RCM installed on an external bus. The performance of the RCM memory is more than competitive with main memory. CR categories and Subject Descriptors: E.2 [Data Storage Representations]: Linked representations; D.4.2 [Storage Management]: Allocation/Deallocation st...

