Results 1  10
of
28
Terrain simplification simplified: A general framework for viewdependent outofcore visualization
 IEEE Transactions on Visualization and Computer Graphics
"... ..."
Single Assignment C  efficient support for highlevel array operations in a functional setting
, 2003
"... ..."
Global Static Indexing for Realtime Exploration of Very Large Regular Grids
, 2001
"... In this paper we introduce a new indexing scheme for progressive traversal and visualization of large regular grids. We demonstrate the potential of our approach by providing a tool that displays at interactive rates planar slices of scalar field data with very modest computing resources. We obtain ..."
Abstract

Cited by 63 (9 self)
 Add to MetaCart
In this paper we introduce a new indexing scheme for progressive traversal and visualization of large regular grids. We demonstrate the potential of our approach by providing a tool that displays at interactive rates planar slices of scalar field data with very modest computing resources. We obtain unprecedented results both in terms of absolute performance and, more importantly, in terms of scalability. On a laptop computer we provide real time interaction with a 2048 3 grid (8 Giganodes) using only 20MB of memory. On an SGI Onyx we slice interactively an 8192 3 grid ( teranodes) using only 60MB of memory. The scheme relies simply on the determination of an appropriate reordering of the rectilinear grid data and a progressive construction of the output slice. The reordering minimizes the amount of I/O performed during the outofcore computation. The progressive and asynchronous computation of the output provides flexible quality/speed tradeoffs and a timecritical and interruptible user interface. 1.
Recursive Array Layouts and Fast Matrix Multiplication
, 1999
"... The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional columnmajor or rowmajor array layouts to incur high variability in memory system performance as matrix size var ..."
Abstract

Cited by 37 (0 self)
 Add to MetaCart
(Show Context)
The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional columnmajor or rowmajor array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. While recursive layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.22.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between ...
Cache oblivious algorithms
 Algorithms for Memory Hierarchies, LNCS 2625
, 2003
"... Abstract. The cache oblivious model is a simple and elegant model to design algorithms that perform well in hierarchical memory models ubiquitous on current systems. This model was first formulated in [22] and has since been a topic of intense research. Analyzing and designing algorithms and data st ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
(Show Context)
Abstract. The cache oblivious model is a simple and elegant model to design algorithms that perform well in hierarchical memory models ubiquitous on current systems. This model was first formulated in [22] and has since been a topic of intense research. Analyzing and designing algorithms and data structures in this model involves not only an asymptotic analysis of the number of steps executed in terms of the input size, but also the movement of data optimally among the different levels of the memory hierarchy. This chapter is aimed as an introduction to the “idealcache ” model of [22] and techniques used to design cache oblivious algorithms. The chapter also presents some experimental insights and results. Part of this work was done while the author was visiting MPISaarbrücken. The
QR Factorization with MortonOrdered Quadtree Matrices for Memory Reuse and Parallelism
 In Proc. 2003 ACM Symp. on Principles and Practice of Parallel Programming
, 2003
"... Quadtree matrices using Mortonorder storage provide natural blocking on every level of a memory hierarchy. Writing the natural recursive algorithms to take advantage of this blocking results in code that honors the memory hierarchy without the need for transforming the code. Furthermore, the divide ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
(Show Context)
Quadtree matrices using Mortonorder storage provide natural blocking on every level of a memory hierarchy. Writing the natural recursive algorithms to take advantage of this blocking results in code that honors the memory hierarchy without the need for transforming the code. Furthermore, the divideandconquer algorithm breaks problems down into independent computations. These independent computations can be dispatched in parallel for straightforward parallel processing. Proofofconcept is given by an algorithm for QR factorization based on Givens rotations for quadtree matrices in Mortonorder storage. The algorithms deliver positive results, competing with and even beating the LAPACK equivalent. Categories and subject descriptors:
Seven at one stroke: Results from a cacheoblivious paradigm for scalable matrix algorithms
 In MSPC ’06: Proc. 2006 Wkshp. Memory System Performance and Correctness
, 2006
"... A blossoming paradigm for blockrecursive matrix algorithms is presented that, at once, attains excellent performance measured by • time, • TLB misses, • L1 misses, • L2 misses, • paging to disk, • scaling on distributed processors, and • portability to multiple platforms. It provides a philosophy a ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
A blossoming paradigm for blockrecursive matrix algorithms is presented that, at once, attains excellent performance measured by • time, • TLB misses, • L1 misses, • L2 misses, • paging to disk, • scaling on distributed processors, and • portability to multiple platforms. It provides a philosophy and tools that allow the programmer to deal with the memory hierarchy invisibly, from L1 and L2 to TLB, paging, and interprocessor communication. Used together, they provide a cacheoblivious style of programming. Plots are presented to support these claims on an implementation of Cholesky factorization crafted directly from the paradigm in C with a few intrinsic calls. The results in this paper focus on lowlevel performance, including the new Mortonhybrid representation to take advantage of hardware and compiler optimizations. In particular, this code beats Intel’s Matrix Kernel Library and matches AMD’s Core Math Library, losing a bit on L1 misses while winning decisively on TLBmisses.
Fast additions on masked integers
 SIGPLAN Not
, 2006
"... Abstract: Suppose the bits of a computer word are partitioned into d disjoint sets, each of which is used to represent one of a dtuple of cartesian indices into ddimensional space. Then, regardless of the partition, simple group operations and comparisons can be implemented for each index on a con ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
Abstract: Suppose the bits of a computer word are partitioned into d disjoint sets, each of which is used to represent one of a dtuple of cartesian indices into ddimensional space. Then, regardless of the partition, simple group operations and comparisons can be implemented for each index on a conventional processor in a sequence of two or three register operations. These indexings allow any blocked algorithm from linear algebra to use some nonstandard matrix orderings that increase locality and enhance their performance. The underlying implementations were designed for alternating bit postitions to index Mortonordered matrices, but they apply, as well, to any bit partitioning. A hybrid ordering of the elements of a matrix becomes possible, therefore, with row/columnmajor ordering within cachesized blocks and Morton ordering of those blocks, themselves. So, one can enjoy the temporal locality of nested blocks, as well as compiler optimizations on row or columnmajor ordering in base blocks. CCS Categories:
An Efficient SemiHierarchical Array Layout
 In Proceedings of the Workshop on Interaction between Compilers and Computer Architectures
, 2001
"... For highlevel programming languages, linear array layout (e.g., column major and row major orders) have de facto been the sole form of mapping array elements to memory. The increasingly deep and complex memory hierarchies present in current computer systems expose several deficiencies of linear arr ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
For highlevel programming languages, linear array layout (e.g., column major and row major orders) have de facto been the sole form of mapping array elements to memory. The increasingly deep and complex memory hierarchies present in current computer systems expose several deficiencies of linear array layouts. One such deficiency is that linear array layouts strongly favor locality in one index dimension of multidimensional arrays. Secondly, the exact mapping of array elements to cache locations depend on the array's size, which effectively renders linear array layouts nonanalyzable with respect to cache behavior. We present and evaluate an alternative, semihierarchical, array layout which differs from linear array layouts by being neutral with respect to locality in different index dimensions and by enabling accurate and precise analysis of cache behaviors at compiletime. Simulation results indicate that the proposed layout may exhibit vastly improved TLB behavior, leading to clearly measurable improvements in execution time, despite a lack of suitable hardware support for address computations. Cache behavior is formalized in terms of conflict vectors, and it is shown how to compute such conflict vectors at compiletime.