Results 1 -
6 of
6
Facilitate SIMD-Code-Generation in the Polyhedral Model by Hardware-aware Automatic Code-Transformation
"... Although Single Instruction Multiple Data (SIMD) units are available in general purpose processors already since the 1990s, state-of-the-art compilers are often still not capable to fully exploit them, i.e., they may miss to achieve the best possible performance. We present a new hardware-aware and ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Although Single Instruction Multiple Data (SIMD) units are available in general purpose processors already since the 1990s, state-of-the-art compilers are often still not capable to fully exploit them, i.e., they may miss to achieve the best possible performance. We present a new hardware-aware and adaptive loop tiling approach that is based on polyhedral transformations and explicitly dedicated to improve on auto-vectorization. It is an extension to the tiling algorithm implemented within the PluTo framework [4, 5]. In its default setting, PluTo uses static tile sizes and is already capable to enable the use of SIMD units but not primarily targeted to optimize it. We experimented with different tile sizes and found a strong re-lationship between their choice, cache size parameters and performance. Based on this, we designed an adaptive pro-cedure that specifically tiles vectorizable loops with dynam-ically calculated sizes. The blocking is automatically fitted to the amount of data read in loop iterations, the available SIMD units and the cache sizes. The adaptive parts are built upon straightforward calculations that are experimen-tally verified and evaluated. Our results show significant im-provements in the number of instructions vectorized, cache miss rates and, finally, running times.
Inter-Tile Reuse Optimization Applied to Bandwidth Constrained Embedded Accelerators
"... Abstract-The adoption of High-Level Synthesis (HLS) tools has significantly reduced accelerator design time. A complex scaling problem that remains is the data transfer bottleneck. To scale-up performance accelerators require huge amounts of data, and are often limited by interconnect resources. In ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract-The adoption of High-Level Synthesis (HLS) tools has significantly reduced accelerator design time. A complex scaling problem that remains is the data transfer bottleneck. To scale-up performance accelerators require huge amounts of data, and are often limited by interconnect resources. In addition, the energy spent by the accelerator is often dominated by the transfer of data, either in the form of memory references or data movement on interconnect. In this paper we drastically reduce accelerator communication by exploration of computation reordering and local buffer usage. Consequently, we present a new analytical methodology to optimize nested loops for intertile data reuse with loop transformations like interchange and tiling. We focus on embedded accelerators that can be used in a multi-accelerator System on Chip (SoC), so performance, area, and energy are key in this exploration. 1) On three common embedded applications in the image/video processing domain (demosaicing, block matching, object detection), we show that our methodology reduces data movement up to 2.1x compared to the best case of intra-tile optimization. 2) We demonstrate that our small accelerators (1-3% FPGA resources) can boost a simple MicroBlaze soft-core to the performance level of a high-end Inteli7 processor.
Locality Optimization For Data Parallel Programs
, 2013
"... This thesis would not exist but for the hard work, dedication, and patience of my project partner in crime Alex Rubinsteyn. It’s been a long journey together, with much headache and boneheadedness and honest mistakes and successes along the way. This work is very much a joint effort, and it doesn’t ..."
Abstract
- Add to MetaCart
(Show Context)
This thesis would not exist but for the hard work, dedication, and patience of my project partner in crime Alex Rubinsteyn. It’s been a long journey together, with much headache and boneheadedness and honest mistakes and successes along the way. This work is very much a joint effort, and it doesn’t quite seem fair that we aren’t graduating together. I would also like to thank my adviser Dennis for his patience, kindness, and encouragement. He took me in when I was a second-year student looking for an adviser, and stuck with me even though I turned out to be an easily-distracted amateur musician. I never cease to be amazed at the breadth of topics he has his intelligent hands in. My beautiful wife Galia also played a key role in making this happen. She’s always a source of brightness in my life, and wouldn’t let me give up even in the hardest moments. I would also like to thank my parents for letting me run off from my tiny Minnesota hometown and do all the crazy things I’ve decided to do. Their total lack of judgment of my actions so long as I am happy has served as wonderful encouragement to try the many beautiful things our world has to offer.
MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures
"... Code transformations, such as loop tiling and loop fusion, are of key importance for the efficient implementation of stencil compu-tations. However, their direct application to a large code base is costly and severely impacts program maintainability. While re-cently introduced domain-specific langua ..."
Abstract
- Add to MetaCart
(Show Context)
Code transformations, such as loop tiling and loop fusion, are of key importance for the efficient implementation of stencil compu-tations. However, their direct application to a large code base is costly and severely impacts program maintainability. While re-cently introduced domain-specific languages facilitate the appli-cation of such transformations, they typically still require manual tuning or auto-tuning techniques to select the transformations that yield optimal performance. In this paper, we introduce MODESTO, a model-driven stencil optimization framework, that for a stencil program suggests program transformations optimized for a given target architecture. Initially, we review and categorize data lo-cality transformations for stencil programs and introduce a sten-cil algebra that allows the expression and enumeration of different stencil program implementation variants. Combining this algebra with a compile-time performance model, we show how to auto-matically tune stencil programs. We use our framework to model the STELLA library and optimize kernels used by the COSMO atmospheric model on multi-core and hybrid CPU-GPU architec-tures. Compared to naive and expert-tuned variants, the automat-ically tuned kernels attain a 2.0–3.1x and a 1.0–1.8x speedup re-spectively.
Oil and Water can mix! Experiences with integrating Polyhedral and AST-based Transformations
"... Abstract—The polyhedral model is an algebraic framework for affine program representations and transformations for enhancing locality and parallelism. Compared with traditional AST-based transformation frameworks, the polyhedral model can easily handle imperfectly nested loops and complex data depen ..."
Abstract
- Add to MetaCart
Abstract—The polyhedral model is an algebraic framework for affine program representations and transformations for enhancing locality and parallelism. Compared with traditional AST-based transformation frameworks, the polyhedral model can easily handle imperfectly nested loops and complex data dependences within and across loop nests in a unified framework. On the other hand, AST-based transformation frameworks for locality and parallelism have a long history that dates back to early vectorizing and parallelizing compilers. They can be used to efficiently perform a wide range of transformations including hierarchical parametric tiling, parallel reduction, scalar replacement and unroll-and-jam, and the implemented loop transformations are more compact (with smaller code size) than polyhedral frameworks. While many members of the polyhedral and AST-based transformation camps see the two frameworks as a mutually exclusive either-or choice, our experience has been that both frameworks can be integrated in a synergistic manner. In this paper, we present our early experiences with integrating polyhedral and AST-based transformations. Our preliminary experiments demonstrate the benefits of the proposed combined approach relative to Pluto, a pure polyhedral framework for locality and parallelism optimizations. I.
3 authors, including:
, 2013
"... All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. ..."
Abstract
- Add to MetaCart
(Show Context)
All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.