Results 1 -
2 of
2
Efficient Utilization of SIMD Extensions
- IEEE PROCEEDINGS SPECIAL ISSUE ON PROGRAM GENERATION, OPTIMIZATION, AND PLATFORM ADAPTATION
, 2003
"... This paper targets automatic performance tuning of numerical kernels in the presence of multi-layered memory hierarchies and SIMD parallelism. The studied SIMD instruction set extensions include Intel’s SSE family, AMD’s 3DNow!, Motorola’s AltiVec, and IBM’s BlueGene/L SIMD instructions. FFTW, ATLA ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
This paper targets automatic performance tuning of numerical kernels in the presence of multi-layered memory hierarchies and SIMD parallelism. The studied SIMD instruction set extensions include Intel’s SSE family, AMD’s 3DNow!, Motorola’s AltiVec, and IBM’s BlueGene/L SIMD instructions. FFTW, ATLAS, and SPIRAL demonstrate that near-optimal performance of numerical kernels across a variety of modern computers featuring deep memory hierarchies can be achieved only by means of automatic performance tuning. These software packages generate and optimize ANSI C code and feed it into the target machine’s general purpose C compiler to maintain portability. The scalar C code produced by performance tuning systems poses a severe challenge for vectorizing compilers. The particular code structure hampers automatic vectorization and thus inhibits satisfactory performance on processors featuring short vector extensions. This paper describes special purpose compiler technology that supports automatic performance tuning on machines with vector instructions. The work described includes (i) symbolic vectorization of DSP transforms, (ii) straight-line code vectorization for numerical kernels, and (iii) compiler backends for straight-line code with vector instructions. Methods from all three areas were combined with FFTW, SPIRAL, and ATLAS to optimize both for memory hierarchy and vector instructions. Experiments show that the presented methods lead to substantial speed-ups (up to 1.8 for two-way and 3.3 for four-way vector extensions) over the best scalar C codes generated by the original systems as well as roughly matching the performance of hand-tuned vendor libraries.
Near-Optimal Instruction Selection on DAGs
, 2008
"... Instruction selection is a key component of code generation. High quality instruction selection is of particular importance in the embedded space where complex instruction sets are common and code size is a prime concern. Although instruction selection on tree expressions is a well understood and ea ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Instruction selection is a key component of code generation. High quality instruction selection is of particular importance in the embedded space where complex instruction sets are common and code size is a prime concern. Although instruction selection on tree expressions is a well understood and easily solved problem, instruction selection on directed acyclic graphs is NP-complete. In this paper we present NOLTIS, a near-optimal, linear time instruction selection algorithm for DAG expressions. NOLTIS is easy to implement, fast, and effective with a demonstrated average code size improvement of 5.1 % compared to the traditional tree decomposition and tiling approach.

