Results 1 -
4 of
4
Register allocation for programs in ssa-form
- In Compiler Construction 2006, volume 3923 of LNCS
, 2006
"... In this technical report, we present an architecture for register allocation on the SSA-form. We show, how the properties of SSA-form programs and their interference graphs can be exploited to develop new methods for spilling, coloring and coalescing. We present heuristic and optimal solution method ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
In this technical report, we present an architecture for register allocation on the SSA-form. We show, how the properties of SSA-form programs and their interference graphs can be exploited to develop new methods for spilling, coloring and coalescing. We present heuristic and optimal solution methods for these three subtasks. 1
Efficient Utilization of SIMD Extensions
- IEEE PROCEEDINGS SPECIAL ISSUE ON PROGRAM GENERATION, OPTIMIZATION, AND PLATFORM ADAPTATION
, 2003
"... This paper targets automatic performance tuning of numerical kernels in the presence of multi-layered memory hierarchies and SIMD parallelism. The studied SIMD instruction set extensions include Intel’s SSE family, AMD’s 3DNow!, Motorola’s AltiVec, and IBM’s BlueGene/L SIMD instructions. FFTW, ATLA ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
This paper targets automatic performance tuning of numerical kernels in the presence of multi-layered memory hierarchies and SIMD parallelism. The studied SIMD instruction set extensions include Intel’s SSE family, AMD’s 3DNow!, Motorola’s AltiVec, and IBM’s BlueGene/L SIMD instructions. FFTW, ATLAS, and SPIRAL demonstrate that near-optimal performance of numerical kernels across a variety of modern computers featuring deep memory hierarchies can be achieved only by means of automatic performance tuning. These software packages generate and optimize ANSI C code and feed it into the target machine’s general purpose C compiler to maintain portability. The scalar C code produced by performance tuning systems poses a severe challenge for vectorizing compilers. The particular code structure hampers automatic vectorization and thus inhibits satisfactory performance on processors featuring short vector extensions. This paper describes special purpose compiler technology that supports automatic performance tuning on machines with vector instructions. The work described includes (i) symbolic vectorization of DSP transforms, (ii) straight-line code vectorization for numerical kernels, and (iii) compiler backends for straight-line code with vector instructions. Methods from all three areas were combined with FFTW, SPIRAL, and ATLAS to optimize both for memory hierarchy and vector instructions. Experiments show that the presented methods lead to substantial speed-ups (up to 1.8 for two-way and 3.3 for four-way vector extensions) over the best scalar C codes generated by the original systems as well as roughly matching the performance of hand-tuned vendor libraries.
FFT Compiler Techniques
- In Compiler Construction: 13th International Conference, CC 2004, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2004
"... Abstract. This paper presents compiler technology that targets general purpose microprocessors augmented with SIMD execution units for exploiting data level parallelism. Numerical applications are accelerated by automatically vectorizing blocks of straight line code for processors featuring two-way ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. This paper presents compiler technology that targets general purpose microprocessors augmented with SIMD execution units for exploiting data level parallelism. Numerical applications are accelerated by automatically vectorizing blocks of straight line code for processors featuring two-way short vector SIMD extensions like Intel’s SSE 2 on Pentium 4, SSE 3 on Pentium 5, AMD’s 3DNow! , and IBM’s SIMD operations implemented on the new processors of the BlueGene/L supercomputer. The paper introduces a special compiler backend which is able (i) to exploit particular properties of FFT code, (ii) to generate optimized address computation, and (iii) to perform specialized register allocation and instruction scheduling. Experiments show that the presented automatic SIMD vectorization can achieve performance that is comparable to the hand optimized code for key benchmarks. The newly developed methods have been integrated into the codelet generator of Fftw and successfully vectorized complicated code like real-to-halfcomplex non-power of two FFT kernels. The floating-point performance of Fftw’s scalar version has been more than doubled, resulting in the fastest FFT implementation to date. 1
An Experimental Comparison of
"... Cache-oblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divide-and-conquer paradigm – each division step creates sub-problems of ..."
Abstract
- Add to MetaCart
Cache-oblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divide-and-conquer paradigm – each division step creates sub-problems of smaller size, and when the working set of a sub-problem fits in some level of the memory hierarchy, the computations in that sub-problem can be executed without suffering capacity misses at that level. In this way, divide-and-conquer algorithms adapt automatically to all levels of the memory hierarchy; in fact, for problems like matrix multiplication, matrix transpose, and FFT, these recursive algorithms are optimal to within constant factors for some theoretical models of the memory hierarchy. An important question is the following: how well do carefully tuned cache-oblivious programs perform compared to carefully tuned cache-conscious programs for the same problem? Is there a price for obliviousness, and if so, how much performance do we lose? Somewhat surprisingly, there are few studies in the literature that have addressed this question. This paper reports the results of such a study in the domain of dense linear algebra. Our main finding is that in this domain, even highly optimized cache-oblivious programs perform significantly worse than corresponding cache-conscious programs. We provide insights into why this is so, and suggest research directions for making cache-oblivious algorithms more competitive with cache-conscious algorithms. 1.

