Results 1 - 10
of
13
Sponge: Portable stream programming on graphics engines
- In ASPLOS’11
, 2011
"... Graphics processing units (GPUs) provide a low cost platform for accelerating high performance computations. The introduction of new programming languages, such as CUDA and OpenCL, makes GPU programming attractive to a wide variety of programmers. However, programming GPUs is still a cumbersome task ..."
Abstract
-
Cited by 36 (7 self)
- Add to MetaCart
(Show Context)
Graphics processing units (GPUs) provide a low cost platform for accelerating high performance computations. The introduction of new programming languages, such as CUDA and OpenCL, makes GPU programming attractive to a wide variety of programmers. However, programming GPUs is still a cumbersome task for two primary reasons: tedious performance optimizations and lack of portability. First, optimizing an algorithm for a specific GPU is a time-consuming task that requires a thorough understanding of both the algorithm and the underlying hardware. Unoptimized CUDA programs typically only achieve a small fraction of the peak GPU performance. Second, GPU code lacks efficient portability as code written for one GPU can be inefficient when executed on another. Moving code from one GPU to another while maintaining the desired performance is a non-trivial task often requiring significant modifications to account for the hardware differences. In this work, we propose Sponge, a compilation framework for GPUs using synchronous data flow streaming languages. Sponge is capable of performing a wide variety of optimizations to generate efficient code for graphics engines. Sponge alleviates the problems associated with current GPU programming methods by providing portability across different generations of GPUs and CPUs, and a better abstraction of the hardware details, such as the memory hierarchy and threading model. Using streaming, we provide a writeonce software paradigm and rely on the compiler to automatically create optimized CUDA code for a wide variety of GPU targets. Sponge’s compiler optimizations improve the performance of the baseline CUDA implementations by an average of 3.2x.
Adaptive Input-aware Compilation for Graphics Engines
, 2012
"... Whileg raphics processing units (GPUs) provide low-cost and efficient platforms for accelerating high performance computations, thetediousprocessofperformancetuningrequired tooptimizeapplicationsisanobstacletowideradoptionofGPUs. In addition totheprogrammabilitychallengesposed by GPU’s complex memor ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
Whileg raphics processing units (GPUs) provide low-cost and efficient platforms for accelerating high performance computations, thetediousprocessofperformancetuningrequired tooptimizeapplicationsisanobstacletowideradoptionofGPUs. In addition totheprogrammabilitychallengesposed by GPU’s complex memoryhierarchyandparallelismmodel, a well-known applicationdesignproblemistargetportability across different GPUs. However, evenforasingleGPUtarget, changingaprogram’s inputcharacteristicscanmakeanal ready-optimized implementationofaprogramperformpoorly. Inthiswork, weproposeAdaptic, an adaptive input-awarecompilationsystemtotacklethisimportant,yetoverlooked, inputportabilityproblem.Usingthissystem,programmers developtheirapplicationsinahigh-levelstreaminglanguageand letAdapticundertakethedifficulttaskofinputportableoptimizationsandcodegeneration.Severalinput-awareoptimizationsare introducedtomakeefficientuseofthememoryhierarchyandcustomizethreadcomposition.Atruntime,aproperlyoptimizedversion oftheapplicationisexecutedbasedontheactualprograminput.We performahead-to-headcomparisonbetweentheAdapticgenerated andhand-optimizedCUDAprograms.TheresultsshowthatAdaptic iscapableofgeneratingcodesthatcanperformonparwiththeir hand-optimizedcounterpartsovercertaininputrangesandout performthemwhentheinputfallsoutofthehand-optimizedprograms’ “comfort zone”.Furthermore,weshowthatinput-awareresultsare sustainableacrossdifferentGPUtargets making it possible to write and optimize applications once and run them anywhere.
Improving performance of opencl on cpus
- CC
, 2012
"... Abstract. Data-parallel languages like OpenCL and CUDA are an important means to exploit the computational power of today's computing devices. In this paper, we deal with two aspects of implementing such languages on CPUs: First, we present a static analysis and an accompanying optimization to ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Data-parallel languages like OpenCL and CUDA are an important means to exploit the computational power of today's computing devices. In this paper, we deal with two aspects of implementing such languages on CPUs: First, we present a static analysis and an accompanying optimization to exclude code regions from control-flow to dataflow conversion, which is the commonly used technique to leverage vector instruction sets. Second, we present a novel technique to implement barrier synchronization. We evaluate our techniques in a custom OpenCL CPU driver which is compared to itself in different configurations and to proprietary implementations by AMD and Intel. We achieve an average speedup factor of 1.21 compared to naïve vectorization and additional factors of 1.15-2.09 for suited kernels due to the optimizations enabled by our analysis. Our best configuration achieves an average speedup factor of over 2.5 against the Intel driver.
Joint scheduling and layout optimization to enable multi-level vectorization
- In Second Int. Workshop on Polyhedral Compilation Techniques (IMPACT 2012
, 2012
"... We describe a novel loop nest scheduling strategy imple-mented in the R-Stream compiler1: the first scheduling for-mulation to jointly optimize a trade-off between parallelism, locality, contiguity of array accesses and data layout permu-tations in a single complete formulation. Our search space con ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
We describe a novel loop nest scheduling strategy imple-mented in the R-Stream compiler1: the first scheduling for-mulation to jointly optimize a trade-off between parallelism, locality, contiguity of array accesses and data layout permu-tations in a single complete formulation. Our search space contains the maximal amount of vectorization in the pro-gram and automatically finds opportunities for automatic multi-level vectorization and simd-ization. Using our model of memory layout, we demonstrate that the amount of con-tiguous accesses, vectorization and simd-ization can be in-creased modulo data layout permutations automatically ex-posed by our technique. This additional degree of freedom opens new opportunities for the scheduler that were previ-ously out of reach. But perhaps the most significant aspect of this work is to encompass an ever increasing number of traditional optimization phases into a single pass. Our ap-proach offers a good solution to the fundamental problem of phase ordering of high-level loop transformations. 1.
Compiler Techniques for . . . Stream Programs on Multicore Architectures
, 2010
"... Given the ubiquity of multicore processors, there is an acute need to enable the development of scalable parallel applications without unduly burdening programmers. Currently, programmers are asked not only to explicitly expose parallelism but also concern themselves with issues of granularity, load ..."
Abstract
- Add to MetaCart
Given the ubiquity of multicore processors, there is an acute need to enable the development of scalable parallel applications without unduly burdening programmers. Currently, programmers are asked not only to explicitly expose parallelism but also concern themselves with issues of granularity, load-balancing, synchronization, and communication. This thesis demonstrates that when algorithmic parallelism is expressed in the form of a stream program, a compiler can effectively and automatically manage the parallelism. Our compiler assumes responsibility for low-level architectural details, transforming implicit algorithmic parallelism into a mapping that achieves scalable parallel performance for a given multicore target. Stream programming is characterized by regular processing of sequences of data, and it is a natural expression of algorithms in the areas of audio, video, digital signal processing, network-
Scheduling and Optimizing Stream Programs on Multicore Machines by Exploiting High-Level Abstractions
, 2013
"... Copyright © 2013, by the author(s). ..."
VECTORIZATION AND MAPPING OF SOFTWARE DEFINED RADIO APPLICATIONS ON HETEROGENEOUS MULTI-PROCESSOR PLATFORMS
"... A variety of multiprocessor architectures have proliferated even for off-the-shelf computing platforms. To improve performance and productivity for common heterogeneous systems, we have developed a workflow to generate efficient solutions. By starting with a formal description of an application and ..."
Abstract
- Add to MetaCart
(Show Context)
A variety of multiprocessor architectures have proliferated even for off-the-shelf computing platforms. To improve performance and productivity for common heterogeneous systems, we have developed a workflow to generate efficient solutions. By starting with a formal description of an application and the mapping problem we are able to generate a range of designs that efficiently trade-of latency and throughput. In this approach, efficient utilization of SIMD cores is achieved by applying extensive block processing in conjunction with efficient mapping and scheduling. We demonstrate our approach through an integration into the GNU Radio environment for software defined radio system design.
Compiling Stream Applications for Heterogeneous Architectures
, 2011
"... First, I would like to express my sincerest gratitude to my adviser Prof. Scott Mahlke. I consider myself truly lucky to have worked with him these past years. He has shown incredible patience, served as an excellent mentor, and provided me every opportunity to succeed in this field. I also owe than ..."
Abstract
- Add to MetaCart
(Show Context)
First, I would like to express my sincerest gratitude to my adviser Prof. Scott Mahlke. I consider myself truly lucky to have worked with him these past years. He has shown incredible patience, served as an excellent mentor, and provided me every opportunity to succeed in this field. I also owe thanks to the remaining members of my dissertation committee, Prof. Austin, Prof. Mudge, Prof. Sylvester and Dr. Rodric Rabbah. They all donated their time to help shape this research into what it has become today. I would particularly like to thank Rodric for his insightful comments and invaluable advice during my internships at IBM T.J. Watson that helped me in finding an interesting research path. I was lucky to be part of a research group whose members not only assisted me intellectually in my research but were also a comfort during those long nights before each deadline. Nathan Clark helped me in the first two of years of my PhD. His patience and technical help was the reason I survived those years. Mark Woh spent a countless number of hours discussing new ideas with me and helping me write my papers. Shuguang Feng, Shantanu Gupta, Ganesh Dasika, and Mojtaba Mehrara also helped in proof reading the papers and refining my ideas. Mehrzad Samadi did a great deal of work on the part of this thesis presented in Chapter V. iii More importantly than the technical assistance, I would like to thank all the members of the CCCP research group who I’ve ever shared an office with over the years for their social