Results 1 - 10
of
62
CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization ∗
"... As the computational power of GPUs continues to scale with Moore’s Law, an increasing number of applications are becoming limited by memory bandwidth. We propose an approach for programming GPUs with tightly-coupled specialized DMA warps for performing memory transfers between on-chip and off-chip m ..."
Abstract
-
Cited by 25 (2 self)
- Add to MetaCart
(Show Context)
As the computational power of GPUs continues to scale with Moore’s Law, an increasing number of applications are becoming limited by memory bandwidth. We propose an approach for programming GPUs with tightly-coupled specialized DMA warps for performing memory transfers between on-chip and off-chip memories. Separate DMA warps improve memory bandwidth utilization by better exploiting available memory-level parallelism and by leveraging efficient inter-warp producer-consumer synchronization mechanisms. DMA warps also improve programmer productivity by decoupling the need for thread array shapes to match data layout. To illustrate the benefits of this approach, we present an extensible API, CudaDMA, that encapsulates synchronization and common sequential and strided data transfer patterns. Using CudaDMA, we demonstrate speedup of up to 1.37x on representative synthetic microbenchmarks, and 1.15x-3.2x on several kernels from scientific applications written in CUDA running on NVIDIA Fermi GPUs. 1.
Terra: A Multi-Stage Language for High-Performance Computing
"... High-performance computing applications, such as auto-tuners and domain-specific languages, rely on generative programming techniques to achieve high performance and portability. However, these systems are often implemented in multiple disparate languages and perform code generation in a separate pr ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
(Show Context)
High-performance computing applications, such as auto-tuners and domain-specific languages, rely on generative programming techniques to achieve high performance and portability. However, these systems are often implemented in multiple disparate languages and perform code generation in a separate process from program execution, making certain optimizations difficult to engineer. We leverage a popular scripting language, Lua, to stage the execution of a novel low-level language, Terra. Users can implement optimizations in the high-level language, and use built-in constructs to generate and execute high-performance Terra code. To simplify metaprogramming, Lua and Terra share the same lexical environment, but, to ensure performance, Terra code can execute independently of Lua’s runtime. We evaluate our design by reimplementing existing multi-language systems entirely in Terra. Our Terra-based autotuner for BLAS routines performs within 20 % of ATLAS, and our DSL for stencil computations runs 2.3x faster than hand-written C.
Nested data-parallelism on the GPU
- In Proceedings of the 17th ACM SIGPLAN International Conference on Functional Programming (ICFP 2012
, 2012
"... Graphics processing units (GPUs) provide both memory bandwidth and arithmetic performance far greater than that available on CPUs but, because of their Single-Instruction-Multiple-Data (SIMD) ar-chitecture, they are hard to program. Most of the programs ported to GPUs thus far use traditional data-l ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
(Show Context)
Graphics processing units (GPUs) provide both memory bandwidth and arithmetic performance far greater than that available on CPUs but, because of their Single-Instruction-Multiple-Data (SIMD) ar-chitecture, they are hard to program. Most of the programs ported to GPUs thus far use traditional data-level parallelism, performing only operations that operate uniformly over vectors. NESL is a first-order functional language that was designed to allow programmers to write irregular-parallel programs — such as parallel divide-and-conquer algorithms — for wide-vector parallel computers. This paper presents our port of the NESL implementa-tion to work on GPUs and provides empirical evidence that nested data-parallelism (NDP) on GPUs significantly outperforms CPU-based implementations and matches or beats newer GPU languages that support only flat parallelism. While our performance does not match that of hand-tuned CUDA programs, we argue that the nota-tional conciseness of NESL is worth the loss in performance. This work provides the first language implementation that directly sup-ports NDP on a GPU.
Kernel Weaver: Automatically Fusing Database
- Primitives for Efficient GPU Computation.” MICRO
, 2012
"... Data warehousing applications represent an emerging application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high bandwidth architectures that potentially offer substantial improvements in throughput for these ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
(Show Context)
Data warehousing applications represent an emerging application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high bandwidth architectures that potentially offer substantial improvements in throughput for these applications. However, there are significant challenges that arise due to the overheads of data movement through the memory hierarchy and between the GPU and host CPU. This paper proposes data movement optimizations to address these challenges. Inspired in part by loop fusion optimizations in the scientific computing community, we propose kernel fusion as a basis for data movement optimizations. Kernel fusion fuses the code bodies of two GPU kernels to i) reduce data footprint to cut down data movement throughout GPU and CPU memory hierarchy, and ii) enlarge compiler optimization scope. We classify producer consumer dependences between compute kernels into three types, i) fine-grained thread-to-thread dependences, ii) medium-grained thread block dependences, and iii) coarse-grained kernel dependences. Based on this classification, we propose a compiler framework, Kernel Weaver, that can automatically fuse relational algebra operators thereby eliminating redundant data movement. The experiments on NVIDIA Fermi platforms demonstrate that kernel fusion achieves 2.89x speedup in GPU computation and a 2.35x speedup in PCIe transfer time on average across the microbenchmarks tested. We present key insights, lessons learned, measurements from our compiler implementation, and opportunities for further improvements. 1.
Adaptive Input-aware Compilation for Graphics Engines
, 2012
"... Whileg raphics processing units (GPUs) provide low-cost and efficient platforms for accelerating high performance computations, thetediousprocessofperformancetuningrequired tooptimizeapplicationsisanobstacletowideradoptionofGPUs. In addition totheprogrammabilitychallengesposed by GPU’s complex memor ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
Whileg raphics processing units (GPUs) provide low-cost and efficient platforms for accelerating high performance computations, thetediousprocessofperformancetuningrequired tooptimizeapplicationsisanobstacletowideradoptionofGPUs. In addition totheprogrammabilitychallengesposed by GPU’s complex memoryhierarchyandparallelismmodel, a well-known applicationdesignproblemistargetportability across different GPUs. However, evenforasingleGPUtarget, changingaprogram’s inputcharacteristicscanmakeanal ready-optimized implementationofaprogramperformpoorly. Inthiswork, weproposeAdaptic, an adaptive input-awarecompilationsystemtotacklethisimportant,yetoverlooked, inputportabilityproblem.Usingthissystem,programmers developtheirapplicationsinahigh-levelstreaminglanguageand letAdapticundertakethedifficulttaskofinputportableoptimizationsandcodegeneration.Severalinput-awareoptimizationsare introducedtomakeefficientuseofthememoryhierarchyandcustomizethreadcomposition.Atruntime,aproperlyoptimizedversion oftheapplicationisexecutedbasedontheactualprograminput.We performahead-to-headcomparisonbetweentheAdapticgenerated andhand-optimizedCUDAprograms.TheresultsshowthatAdaptic iscapableofgeneratingcodesthatcanperformonparwiththeir hand-optimizedcounterpartsovercertaininputrangesandout performthemwhentheinputfallsoutofthehand-optimizedprograms’ “comfort zone”.Furthermore,weshowthatinput-awareresultsare sustainableacrossdifferentGPUtargets making it possible to write and optimize applications once and run them anywhere.
NOVA : A functional language for data parallelism
, 2013
"... Functional languages provide a solid foundation on which complex optimization passes can be designed to exploit available parallelism in the underlying system. Their mathematical foundations enable high-level optimizations that would be impossible in traditional im-perative languages. This makes the ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
Functional languages provide a solid foundation on which complex optimization passes can be designed to exploit available parallelism in the underlying system. Their mathematical foundations enable high-level optimizations that would be impossible in traditional im-perative languages. This makes them uniquely suited for generation of efficient target code for parallel systems, such as multiple Central Processing Units (CPUs) or highly data-parallel Graphics Process-ing Units (GPUs). Such systems are becoming the mainstream for scientific and ‘desktop ’ computing. Writing performance portable code for such systems using low-level languages requires significant effort from a human expert. This paper presents NOVA, a functional language and compiler for multi-core CPUs and GPUs. The NOVA language is a polymorphic, statically-typed functional language with a suite of higher-order functions which are used to express parallelism. These include map, reduce and scan. The NOVA compiler is a light-weight, yet powerful, optimizing compiler. It generates code for a variety of target platforms that achieve performance comparable to competing languages and tools, including hand-optimized code. The NOVA compiler is stand-alone and can be easily used as a target for higher-level or domain specific languages or embedded in other applications. We evaluate NOVA against two competing approaches: the Thrust library and hand-written CUDA C. NOVA achieves com-parable performance to these approaches across a range of bench-marks. NOVA-generated code also scales linearly with the number of processor cores across all compute-bound benchmarks. 1.
Dandelion: a compiler and runtime for heterogeneous systems
- in Proc. of the Twenty-Fourth ACM Symp. on Operating Systems Principles. ACM
"... Computer systems increasingly rely on heterogeneity to achieve greater performance, scalability and en-ergy efficiency. Because heterogeneous systems typi-cally comprise multiple execution contexts with differ-ent programming abstractions and runtimes, program-ming them remains extremely challenging ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Computer systems increasingly rely on heterogeneity to achieve greater performance, scalability and en-ergy efficiency. Because heterogeneous systems typi-cally comprise multiple execution contexts with differ-ent programming abstractions and runtimes, program-ming them remains extremely challenging. Dandelion is a system designed to address this pro-grammability challenge for data-parallel applications. Dandelion provides a unified programming model for heterogeneous systems that span diverse execution con-texts including CPUs, GPUs, FPGAs, and the cloud. It adopts the.NET LINQ (Language INtegrated Query) ap-proach, integrating data-parallel operators into general purpose programming languages such as C # and F#. It therefore provides an expressive data model and native language integration for user-defined functions, enabling programmers to write applications using standard high-level languages and development tools. Dandelion automatically and transparently distributes data-parallel portions of a program to available comput-ing resources, including compute clusters for distributed execution and CPU and GPU cores of individual nodes for parallel execution. To enable automatic execution of.NET code on GPUs, Dandelion cross-compiles.NET code to CUDA kernels and uses the PTask runtime [85] to manage GPU execution. This paper discusses the de-sign and implementation of Dandelion, focusing on the distributed CPU and GPU implementation. We evaluate the system using a diverse set of workloads. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. Copyright is held by the Owner/Author(s).
Composition and reuse with compiled domain-specific languages
- In Proceedings of ECOOP
, 2013
"... Abstract. Programmers who need high performance currently rely on low-level, architecture-specific programming models (e.g. OpenMP for CMPs, CUDA for GPUs, MPI for clusters). Performance optimization with these frameworks usually requires expertise in the specific programming model and a deep unders ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
(Show Context)
Abstract. Programmers who need high performance currently rely on low-level, architecture-specific programming models (e.g. OpenMP for CMPs, CUDA for GPUs, MPI for clusters). Performance optimization with these frameworks usually requires expertise in the specific programming model and a deep understanding of the target architecture. Domain-specific languages (DSLs) are a promising alternative, allowing compilers to map problem-specific abstractions directly to low-level architecture-specific programming models. However, developing DSLs is difficult, and using multiple DSLs together in a single application is even harder because existing compiled solutions do not compose together. In this paper, we present four new performance-oriented DSLs developed with Delite, an extensible DSL compilation framework. We demonstrate new techniques to compose compiled DSLs embedded in a common backend together in a single program and show that generic optimizations can be applied across the different DSL sections. Our new DSLs are implemented with a small number of reusable components (less than 9 parallel operators total) and still achieve performance up to 125x better than library implementations and at worst within 30 % of optimized stand-alone DSLs. The DSLs retain good performance when composed together, and applying cross-DSL optimizations results in up to an additional 1.82x improvement. 1
HiDP: A Hierarchical Data Parallel Language ∗
"... Problem domains are commonly decomposed hierarchically to fully utilize parallel resources in modern microprocessors. Such decompositions can be provided as library routines, written by experienced experts, for general algorithmic patterns. But such APIs tend to be constrained to certain architectur ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
Problem domains are commonly decomposed hierarchically to fully utilize parallel resources in modern microprocessors. Such decompositions can be provided as library routines, written by experienced experts, for general algorithmic patterns. But such APIs tend to be constrained to certain architectures or data sizes. Integrating them with application code is often an unnecessarily daunting task, especially when these routines need to be closely coupled with user code to achieve better performance. This paper contributes HiDP, a hierarchical data parallel language. The purpose of HiDP is to improve the coding productivity of integrating hierarchical data parallelism without significant loss of performance. HiDP is a sourceto-source compiler that converts a very concise data parallel language into CUDA C++ source code. Internally, it performs necessary analysis to compose user code with efficient and architecture-aware code snippets. This paper discusses various aspects of HiDP systematically: the language, the compiler and the run-time system with built-in tuning capabilities. They enable HiDP users to express algorithms in less code than low-level SDKs require for native platforms. HiDP also exposes abundant computing resources of modern parallel architectures. Improved coding productivity tends to come with a sacrifice in performance. Yet, experimental results show that the generated code delivers performance very close to handcrafted native GPU code. 1.
Applying graphics processor acceleration in a software defined radio prototyping environments
- in Proceedings of the International Symposium on Rapid System Prototyping
, 2011
"... Abstract—With higher bandwidth requirements and more complex protocols, software defined radio (SDR) has ever growing computational demands. SDR applications have different levels of parallelism that can be exploited on multicore platforms, but design and programming difficulties have inhibited the ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
(Show Context)
Abstract—With higher bandwidth requirements and more complex protocols, software defined radio (SDR) has ever growing computational demands. SDR applications have different levels of parallelism that can be exploited on multicore platforms, but design and programming difficulties have inhibited the adoption of specialized multicore platforms like graphics processors (GPUs). In this work we propose a new design flow that augments a popular existing SDR development environment (GNU Radio), with a dataflow foundation and a stand-alone GPU accelerated library. The approach gives an SDR developer the ability to prototype a GPU accelerated application and explore its design space fast and effectively. We demonstrate this design flow on a standard SDR benchmark and show that deciding how to utilize a GPU can be non-trivial for even relatively simple applications. I.