## Fast additions on masked integers (2006)

Venue: | SIGPLAN Not |

Citations: | 4 - 3 self |

### BibTeX

@ARTICLE{Adams06fastadditions,

author = {Michael D. Adams and David S. Wise},

title = {Fast additions on masked integers},

journal = {SIGPLAN Not},

year = {2006},

volume = {41},

pages = {10--1145}

}

### OpenURL

### Abstract

Abstract: Suppose the bits of a computer word are partitioned into d disjoint sets, each of which is used to represent one of a d-tuple of cartesian indices into d-dimensional space. Then, regardless of the partition, simple group operations and comparisons can be implemented for each index on a conventional processor in a sequence of two or three register operations. These indexings allow any blocked algorithm from linear algebra to use some non-standard matrix orderings that increase locality and enhance their performance. The underlying implementations were designed for alternating bit postitions to index Morton-ordered matrices, but they apply, as well, to any bit partitioning. A hybrid ordering of the elements of a matrix becomes possible, therefore, with row-/columnmajor ordering within cache-sized blocks and Morton ordering of those blocks, themselves. So, one can enjoy the temporal locality of nested blocks, as well as compiler optimizations on row- or column-major ordering in base blocks. CCS Categories:

### Citations

8542 |
Introduction to Algorithms
- Cormen, Leiserson, et al.
- 1990
(Show Context)
Citation Context ...rarchy [9, 13]. At run time this suffices for a single translation look-aside buffer (TLB) entry to span any block. Recursive programming on that nesting leads naturally to cache-oblivious algorithms =-=[5]-=-. Some of its desired locality is still available, however, simply by looping on cartesian indices into an array that is represented in Morton-order [6]. They would be represented as dilated integers.... |

741 |
A set of Level 3 Basic Linear Algebra Subprograms
- Dongarra, Croz, et al.
- 1990
(Show Context)
Citation Context ...able above L1 cache by using recursion on Morton-ordered blocks. Chatterjee et al. created this hybrid ordering and used manufacturers’ optimized BLAS3 libraries to seek high performance and locality =-=[3, 4]-=-. They did not, however, use the address arithmetic described here and also wasted computation converting their structure to and from a pure raster order. For w = 32, one can represent Y =.*.*.*.*.*.*... |

138 |
A Computer Oriented Geodetic Data Base and a New Technique in File Sequencing
- Morton
- 1966
(Show Context)
Citation Context ... very fast, repeated register transfer. 2.2 Morton-order Matrices Morton ordering for matrices has the advantage of representing nested blocks in adjacent memory at all levels of the memory hierarchy =-=[9, 13]-=-. At run time this suffices for a single translation look-aside buffer (TLB) entry to span any block. Recursive programming on that nesting leads naturally to cache-oblivious algorithms [5]. Some of i... |

73 |
Reduction of operator strength
- Allen, Cooke, et al.
- 1981
(Show Context)
Citation Context ...t the implicit shifts of cartesian indices into their Y and X bit positions.) The anticipated products, that would shift i to high-order bits are exactly those long associated with strength reduction =-=[1]-=-. That is, representing those products as integers held in high-order bits is exactly the same as the sums from the additions that have always been introduced by optimizing compilers [2]. Fixed interl... |

48 | Recursive array layout and fast parallel matrix multiplication
- Chatterjee, Lebeck, et al.
- 1999
(Show Context)
Citation Context ...able above L1 cache by using recursion on Morton-ordered blocks. Chatterjee et al. created this hybrid ordering and used manufacturers’ optimized BLAS3 libraries to seek high performance and locality =-=[3, 4]-=-. They did not, however, use the address arithmetic described here and also wasted computation converting their structure to and from a pure raster order. For w = 32, one can represent Y =.*.*.*.*.*.*... |

26 | Ahnentafel indexing into Morton-ordered arrays, or matrix locality for free. In: Euro-Par 2000,LNCS1900
- Wise
- 2000
(Show Context)
Citation Context ... very fast, repeated register transfer. 2.2 Morton-order Matrices Morton ordering for matrices has the advantage of representing nested blocks in adjacent memory at all levels of the memory hierarchy =-=[9, 13]-=-. At run time this suffices for a single translation look-aside buffer (TLB) entry to span any block. Recursive programming on that nesting leads naturally to cache-oblivious algorithms [5]. Some of i... |

24 |
Finding neighbors of equal size in linear quadtrees and octrees in constant time
- Schrack
- 1992
(Show Context)
Citation Context ...d be represented as..1...012 and a column index j = 17 as10.001..2, and both of them summed in one bite as101001012. Schrack defines three kinds of padding at the dots, only one of which is used here =-=[12]-=-. If the padding within thevalue of a MaskedInteger is forced to zero bits (Dots become 0), then the representation is normalized; 5 is normalized in Y = 00*000**2 to 001000012. If the padding/dots we... |

9 |
Converting to and from dilated integers
- Raman, Wise
(Show Context)
Citation Context ...al optimizers understand their algebra (for unrolling and code motion). Elementary casting-conversion algorithms between the underlying unsigned-integer typesTand MaskedIntegers are readily available =-=[11]-=-. Casts to and from typeTare reserved to effect those bit-compressing and bit-spreading conversions; table lookup is recommended. But they can be simple: when an integerbo is a power of two, like an i... |

6 |
The History of FORTRAN I, II, and III
- Backus
- 1978
(Show Context)
Citation Context ...gth reduction [1]. That is, representing those products as integers held in high-order bits is exactly the same as the sums from the additions that have always been introduced by optimizing compilers =-=[2]-=-. Fixed interleaving of hardware cache-replacement strategies are known to make power-of-2 strides especially poor when blocks exceed L1 cache capacity. For blocks that fit entirely within L1 cache, h... |

3 | The Opie compiler from row-major source to Morton-ordered matrices
- GABRIEL, WISE
- 2004
(Show Context)
Citation Context ...n indices. A free choice of partitioning allows the physical representation of the matrix to be easily rearranged, introducing new locality into old blocked algorithms that enhances their performance =-=[6]-=-. Let m = �w−1 k=0 mk2k be interpreted as a constant mask of a w-bit computer word; a kth bit whose mk = 0 is excluded, and a kth bit whose mk = 1 is included in the mask. The class of integers repres... |

3 | Special feature: Epigrams on programming - Perlis - 1982 |