Results 1  10
of
10
Tuning Strassen's Matrix Multiplication for Memory Efficiency
 IN PROCEEDINGS OF SC98 (CDROM
, 1998
"... Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of thi ..."
Abstract

Cited by 40 (4 self)
 Add to MetaCart
Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this algorithm that uses several unconventional techniques to make the algorithm memoryfriendly. First, the algorithm internally uses a nonstandard array layout known as Morton order that is based on a quadtree decomposition of the matrix. Second, we dynamically select the recursion truncation point to minimize padding without affecting the performance of the algorithm, which we can do by virtue of the cache behavior of the Morton ordering. Each technique is critical for performance, and their combination as done in our code multiplies their effectiveness. Performance comparisons of our implementation with that of competing implementations show that our implementation often outperforms th...
Research Demonstration of a Hardware ReferenceCounting Heap
, 1997
"... A hardware selfmanaging heap memory (RCM) for languages like LISP, SMALLTALK, and JAVA has been designed, built, tested and benchmarked. On every pointer write from the processor, referencecounting transactions are performed in real time within this memory, and garbage cells are reused without pro ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
A hardware selfmanaging heap memory (RCM) for languages like LISP, SMALLTALK, and JAVA has been designed, built, tested and benchmarked. On every pointer write from the processor, referencecounting transactions are performed in real time within this memory, and garbage cells are reused without processor cycles. A processor allocates new nodes simply by reading from a distinguished location in its address space. The memory hardware also incorporates support for offline, multiprocessing, marksweep garbage collection. Performance statistics are presented from a partial implementation of SCHEME over five different memory models and two garbage collection strategies, from main memory (no access to RCM) to a fully operational RCM installed on an external bus. The performance of the RCM memory is more than competitive with main memory.
Matrix Algorithms using Quadtrees
 IN PROC. ATABLE92
, 1992
"... Many scheduling and synchronization problems for largescale multiprocessing can be overcome using functional (or applicative) programming. With this observation, it is strange that so much attention within the functional programming community has focused on the "aggregate update problem" ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Many scheduling and synchronization problems for largescale multiprocessing can be overcome using functional (or applicative) programming. With this observation, it is strange that so much attention within the functional programming community has focused on the "aggregate update problem" [10]: essentially how to implement FORTRAN arrays. This situation is strange because inplace updating of aggregates belongs more to uniprocessing than to mathematics. Several years ago functional style drew me to treatment of ddimensional arrays as 2^dary trees; in particular, matrices become quaternary trees or quadtrees. This convention yields efficient recopyingcumupdate of any array; recursive, algebraic decomposition of conventional arithmetic algorithms; and uniform representations and algorithms for both dense and sparse matrices. For instance, any nonsingular subtree is a candidate as the pivot block for Gaussian elimination; the restriction actually helps identification of pivot b...
An Architecture for Parallel Symbolic Processing Based on Suspending Construction
, 1995
"... iii ..."
UndulantBlock Elimination and IntegerPreserving Matrix Inversion 1
, 1995
"... 1 c 1994,1995 by the author. This work has been accepted for publication by Science ..."
Abstract
 Add to MetaCart
(Show Context)
1 c 1994,1995 by the author. This work has been accepted for publication by Science
Research Demonstration of a Hardware
, 1997
"... A hardware selfmanaging heap memory (RCM) for languages like LISP, SMALLTALK,andJAVA has been designed, built, tested and benchmarked. On every pointer write from the processor, referencecounting transactions are performed in real time within this memory, and garbage cells are reused without proce ..."
Abstract
 Add to MetaCart
(Show Context)
A hardware selfmanaging heap memory (RCM) for languages like LISP, SMALLTALK,andJAVA has been designed, built, tested and benchmarked. On every pointer write from the processor, referencecounting transactions are performed in real time within this memory, and garbage cells are reused without processor cycles. A processor allocates new nodes simply by reading from a distinguished location in its address space. The memory hardware also incorporates support for offline, multiprocessing, marksweep garbage collection. Performance statistics are presented from a partial implementation of SCHEME over five different memory models and two garbage collection strategies, from main memory (no access to RCM) to a fully operational RCM installed on an external bus. The performance of the RCM memory is more than competitive with main memory. CR categories and Subject Descriptors:
Efficient Parallel Solutions of Indexed Recurrences with Linear Combinations
, 1997
"... We consider a certain generalization of the well known 2nd order linear recurrences X i = a i \Delta X i\Gamma1 + b i \Delta X i\Gamma2 i = 1 : : : n to indexed recurrences with linear combinations X g(i) = a i \Delta X f(i) + b i \Delta X h(i) , where g(i); f(i); h(i) are arbitrary functions from ..."
Abstract
 Add to MetaCart
(Show Context)
We consider a certain generalization of the well known 2nd order linear recurrences X i = a i \Delta X i\Gamma1 + b i \Delta X i\Gamma2 i = 1 : : : n to indexed recurrences with linear combinations X g(i) = a i \Delta X f(i) + b i \Delta X h(i) , where g(i); f(i); h(i) are arbitrary functions from [1; : : : ; n] to [1; : : : ; m]. The problem is to find an efficient parallel algorithm that can compete with the sequential execution of the loop for i = 1; : : : ; n do X [g(i)] = a i \Delta X [f(i)] + b i \Delta X [h(i)] : which solve the above recurrence generalization. Such an algorithm (that uses only O(n) work) can be used for automatic parallelization of sequential loops, which in many practical cases fit to the form of the above generalized recurrence. A natural solution is to transform the above sequential loop into a set of matrix multiplications and use the associative property of matrix multiplications to compute the result in log n parallel steps. We show that unlike the ca...
iii
"... The Office of Graduate Studies has verified and approved the above named committee members. ii ACKNOWLEDGEMENTS This thesis owes its existence to my major professor Dr. Robert van Engelen who showed faith in me and gave me a great opportunity to do research under him. It is because of his guidance a ..."
Abstract
 Add to MetaCart
The Office of Graduate Studies has verified and approved the above named committee members. ii ACKNOWLEDGEMENTS This thesis owes its existence to my major professor Dr. Robert van Engelen who showed faith in me and gave me a great opportunity to do research under him. It is because of his guidance and endurance that I was able to bring this thesis to the shape it currently is in. I would also like to thank Dr. Lois Hawkes and Dr. Xin Yuan for serving in my graduate committe and provide me valuable guidance in my academic coursework. I feel indebted my friend and senior Prasad Kulkarni. It wouldn’t have been possible to come to U.S.A. for graduate studies and complete the term of studies without his guidance, support and encouragement. My parent and sister have also been a great source of strength and support to me. I am grateful to them for their perseverance, without which I couldn’t have imagined what life would have been for me.
UndulantBlock Elimination and IntegerPreserving Matrix Inversion
, 1995
"... A new formulation for $LU$ decomposition allows efficient representation of intermediate matrices while eliminating blocks of various sizes, i.e. during ``undulantblock'' elimination. Its efficiency arises from its design for block encapsulization, implicit in data structures that are con ..."
Abstract
 Add to MetaCart
(Show Context)
A new formulation for $LU$ decomposition allows efficient representation of intermediate matrices while eliminating blocks of various sizes, i.e. during ``undulantblock'' elimination. Its efficiency arises from its design for block encapsulization, implicit in data structures that are convenient both for process scheduling and for memory management. Row/column permutations that can destroy such encapsulizations are deferred. Its algorithms, expressed naturally as functional programs, are well suited to parallel and distributed processing. A given matrix, $A$ is decomposed into two matrices (in the space of just one), plus two permutations. The permutations, $P$ and $Q$, are the row/column rearrangements usual to complete pivoting. %(one of which is $I$ under partial pivoting). The principal results are $L$ and $U'$, where $L$ is properlylower quasitriangular; $U'$ is upper quasitriangular with its quasidiagonal being the inverse of that of $U$ from the usual factorization ($PAQ = (IL)U$), and its proper upper portion identical to $U$. The matrix result is $L+U'$. Algorithms for solving linear systems and matrix inversion follow directly. An example of a motivating data structure, the quadtree representation for matrices, is reviewed. Candidate pivots for Gaussian elimination under that structure are the subtrees, both constraining and assisting the pivot search, as well as decomposing to independent block/tree operations. %block operations decompose nicely there. The elementary algorithms are provided, coded in {\sc Haskell}. Finally, an integerpreserving version is presented replacing Bareiss's algorithm with a parallel equivalent. The decomposition of an integer matrix $A$ to integer matrices $\bar L$, $\bar U'$, and $d$ $=\det A$ follows $L+U'$ decomposition, but the followon algorithm to compute $dA^{1}$ is complicated by the requirement to maintain minimal denominators at every step and to avoid divisions, restricting them to necessarily exact ones.