Results 1 - 10
of
36
PTask: operating system abstractions to manage GPUs as compute devices.
- In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles,
, 2011
"... ABSTRACT We propose a new set of OS abstractions to support GPUs and other accelerator devices as first class computing resources. These new abstractions, collectively called the PTask API, support a dataflow programming model. Because a PTask graph consists of OS-managed objects, the kernel has su ..."
Abstract
-
Cited by 52 (5 self)
- Add to MetaCart
(Show Context)
ABSTRACT We propose a new set of OS abstractions to support GPUs and other accelerator devices as first class computing resources. These new abstractions, collectively called the PTask API, support a dataflow programming model. Because a PTask graph consists of OS-managed objects, the kernel has sufficient visibility and control to provide system-wide guarantees like fairness and performance isolation, and can streamline data movement in ways that are impossible under current GPU programming models. Our experience developing the PTask API, along with a gestural interface on Windows 7 and a FUSE-based encrypted file system on Linux show that the PTask API can provide important systemwide guarantees where there were previously none, and can enable significant performance improvements, for example gaining a 5⇥ improvement in maximum throughput for the gestural interface.
Adaptive Input-aware Compilation for Graphics Engines
, 2012
"... Whileg raphics processing units (GPUs) provide low-cost and efficient platforms for accelerating high performance computations, thetediousprocessofperformancetuningrequired tooptimizeapplicationsisanobstacletowideradoptionofGPUs. In addition totheprogrammabilitychallengesposed by GPU’s complex memor ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
Whileg raphics processing units (GPUs) provide low-cost and efficient platforms for accelerating high performance computations, thetediousprocessofperformancetuningrequired tooptimizeapplicationsisanobstacletowideradoptionofGPUs. In addition totheprogrammabilitychallengesposed by GPU’s complex memoryhierarchyandparallelismmodel, a well-known applicationdesignproblemistargetportability across different GPUs. However, evenforasingleGPUtarget, changingaprogram’s inputcharacteristicscanmakeanal ready-optimized implementationofaprogramperformpoorly. Inthiswork, weproposeAdaptic, an adaptive input-awarecompilationsystemtotacklethisimportant,yetoverlooked, inputportabilityproblem.Usingthissystem,programmers developtheirapplicationsinahigh-levelstreaminglanguageand letAdapticundertakethedifficulttaskofinputportableoptimizationsandcodegeneration.Severalinput-awareoptimizationsare introducedtomakeefficientuseofthememoryhierarchyandcustomizethreadcomposition.Atruntime,aproperlyoptimizedversion oftheapplicationisexecutedbasedontheactualprograminput.We performahead-to-headcomparisonbetweentheAdapticgenerated andhand-optimizedCUDAprograms.TheresultsshowthatAdaptic iscapableofgeneratingcodesthatcanperformonparwiththeir hand-optimizedcounterpartsovercertaininputrangesandout performthemwhentheinputfallsoutofthehand-optimizedprograms’ “comfort zone”.Furthermore,weshowthatinput-awareresultsare sustainableacrossdifferentGPUtargets making it possible to write and optimize applications once and run them anywhere.
NOVA : A functional language for data parallelism
, 2013
"... Functional languages provide a solid foundation on which complex optimization passes can be designed to exploit available parallelism in the underlying system. Their mathematical foundations enable high-level optimizations that would be impossible in traditional im-perative languages. This makes the ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
Functional languages provide a solid foundation on which complex optimization passes can be designed to exploit available parallelism in the underlying system. Their mathematical foundations enable high-level optimizations that would be impossible in traditional im-perative languages. This makes them uniquely suited for generation of efficient target code for parallel systems, such as multiple Central Processing Units (CPUs) or highly data-parallel Graphics Process-ing Units (GPUs). Such systems are becoming the mainstream for scientific and ‘desktop ’ computing. Writing performance portable code for such systems using low-level languages requires significant effort from a human expert. This paper presents NOVA, a functional language and compiler for multi-core CPUs and GPUs. The NOVA language is a polymorphic, statically-typed functional language with a suite of higher-order functions which are used to express parallelism. These include map, reduce and scan. The NOVA compiler is a light-weight, yet powerful, optimizing compiler. It generates code for a variety of target platforms that achieve performance comparable to competing languages and tools, including hand-optimized code. The NOVA compiler is stand-alone and can be easily used as a target for higher-level or domain specific languages or embedded in other applications. We evaluate NOVA against two competing approaches: the Thrust library and hand-written CUDA C. NOVA achieves com-parable performance to these approaches across a range of bench-marks. NOVA-generated code also scales linearly with the number of processor cores across all compute-bound benchmarks. 1.
Dandelion: a compiler and runtime for heterogeneous systems
- in Proc. of the Twenty-Fourth ACM Symp. on Operating Systems Principles. ACM
"... Computer systems increasingly rely on heterogeneity to achieve greater performance, scalability and en-ergy efficiency. Because heterogeneous systems typi-cally comprise multiple execution contexts with differ-ent programming abstractions and runtimes, program-ming them remains extremely challenging ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Computer systems increasingly rely on heterogeneity to achieve greater performance, scalability and en-ergy efficiency. Because heterogeneous systems typi-cally comprise multiple execution contexts with differ-ent programming abstractions and runtimes, program-ming them remains extremely challenging. Dandelion is a system designed to address this pro-grammability challenge for data-parallel applications. Dandelion provides a unified programming model for heterogeneous systems that span diverse execution con-texts including CPUs, GPUs, FPGAs, and the cloud. It adopts the.NET LINQ (Language INtegrated Query) ap-proach, integrating data-parallel operators into general purpose programming languages such as C # and F#. It therefore provides an expressive data model and native language integration for user-defined functions, enabling programmers to write applications using standard high-level languages and development tools. Dandelion automatically and transparently distributes data-parallel portions of a program to available comput-ing resources, including compute clusters for distributed execution and CPU and GPU cores of individual nodes for parallel execution. To enable automatic execution of.NET code on GPUs, Dandelion cross-compiles.NET code to CUDA kernels and uses the PTask runtime [85] to manage GPU execution. This paper discusses the de-sign and implementation of Dandelion, focusing on the distributed CPU and GPU implementation. We evaluate the system using a diverse set of workloads. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. Copyright is held by the Owner/Author(s).
Complexity Analysis and Algorithm Design for Reorganizing Data to Minimize Non-Coalesced Memory Accesses on GPU ∗
"... The performance of Graphic Processing Units (GPU) is sensitive to irregular memory references. Some recent work shows the promise of data reorganization for eliminating non-coalesced memory accesses that are caused by irregular references. However, all previous studies have employed simple, heuristi ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
The performance of Graphic Processing Units (GPU) is sensitive to irregular memory references. Some recent work shows the promise of data reorganization for eliminating non-coalesced memory accesses that are caused by irregular references. However, all previous studies have employed simple, heuristic methods to determine the new data layouts to create. As a result, they either do not provide any performance guarantee or are effective to only some limited scenarios. This paper contributes a fundamental study to the problem. It systematically analyzes the inherent complexity of the problem in various settings, and for the first time, proves that the problem is NP-complete. It then points out the limitations of existing techniques and reveals that in practice, the essence for designing an appropriate data reorganization algorithm can be reduced to a tradeoff among space, time, and complexity. Based on that insight, it develops two new data reorganization algorithms to overcome the limitations of previous methods. Experiments show that an assembly composed of the new algorithms and a previous algorithm can circumvent the inherent complexity in finding optimal data layouts, making it feasible to minimize non-coalesced memory accesses for a variety of irregular applications and settings that are beyond the reach of existing techniques.
Correctly treating synchronizations in compiling fine-grained spmd-threaded programs for cpu
- In Proceedings of International Conference on Parallel Architectures and Compilation Techniques
, 2011
"... Abstract—Automatic compilation for multiple types of devices is important, especially given the current trends towards heterogeneous computing. This paper concentrates on some issues in compiling fine-grained SPMD-threaded code (e.g., GPU CUDA code) for multicore CPUs. It points out some correctness ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
(Show Context)
Abstract—Automatic compilation for multiple types of devices is important, especially given the current trends towards heterogeneous computing. This paper concentrates on some issues in compiling fine-grained SPMD-threaded code (e.g., GPU CUDA code) for multicore CPUs. It points out some correctness pitfalls in existing techniques, particularly in their treatment to implicit synchronizations. It then describes a systematic dependence analysis specially designed for handling implicit synchronizations in SPMD-threaded programs. By unveiling the relations between inter-thread data dependences and correct treatment to synchronizations, it presents a dependence-based solution to the problem. Experiments demonstrate that the proposed techniques can resolve the correctness issues in existing compilation techniques, and help compilers produce correct and efficient translation results.
A large-scale cross-architecture evaluation of thread-coarsening. SC
, 2013
"... OpenCL has become the de-facto data parallel programming model for parallel devices in today’s high-performance su-percomputers. OpenCL was designed with the goal of guar-anteeing program portability across hardware from different vendors. However, achieving good performance is hard, re-quiring manu ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
(Show Context)
OpenCL has become the de-facto data parallel programming model for parallel devices in today’s high-performance su-percomputers. OpenCL was designed with the goal of guar-anteeing program portability across hardware from different vendors. However, achieving good performance is hard, re-quiring manual tuning of the program and expert knowledge of each target device. In this paper we consider a data parallel compiler trans-formation — thread-coarsening — and evaluate its effects across a range of devices by developing a source-to-source OpenCL compiler based on LLVM. We thoroughly evaluate this transformation on 17 benchmarks and five platforms with different coarsening parameters giving over 43,000 dif-ferent experiments. We achieve speedups over 9x on indi-vidual applications and average speedups ranging from 1.15x on the Nvidia Kepler GPU to 1.50x on the AMD Cypress GPU. Finally, we use statistical regression to analyse and explain program performance in terms of hardware-based performance counters.
Portable Mapping of Data Parallel Programs to OpenCL for Heterogeneous Systems
"... General purpose GPU based systems are highly attractive as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This paper presents a compiler based approach to automatically generate optimized OpenCL code from data-p ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
General purpose GPU based systems are highly attractive as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This paper presents a compiler based approach to automatically generate optimized OpenCL code from data-parallel OpenMP programs for GPUs. Such an approach brings together the benefits of a clear high level language (OpenMP) and an emerging standard (OpenCL) for heterogeneous multi-cores. A key feature of our scheme is that it leverages existing transformations, especially data transformations, to improve performance on GPU architectures and uses predictive modeling to automatically determine if it is worthwhile running the OpenCL code on the GPU or OpenMP code on the multi-core host. We applied our approach to the entire NAS parallel benchmark suite and evaluated it on two distinct GPU based systems: Core i7/NVIDIA GeForce GTX 580 and Core i7/AMD Radeon 7970. We achieved average (up to) speedups of 4.51x and 4.20x (143x and 67x) respectively over a sequential baseline. This is, on average, a factor 1.63 and 1.56 times faster than a hand-coded, GPU-specific OpenCL implementation developed by independent expert programmers.
Softshell: Dynamic Scheduling on GPUs
"... In this paper we present Softshell, a novel execution model for devices composed of multiple processing cores operating in a single instruction, multiple data fashion, such as graphics processing units (GPUs). The Softshell model is intuitive and more flexible than the kernel-based adaption of the s ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
In this paper we present Softshell, a novel execution model for devices composed of multiple processing cores operating in a single instruction, multiple data fashion, such as graphics processing units (GPUs). The Softshell model is intuitive and more flexible than the kernel-based adaption of the stream processing model, which is currently the dominant model for general purpose GPU computation. Using the Softshell model, algorithms with a relatively low local degree of parallelism can execute efficiently on massively parallel architectures. Softshell has the following distinct advantages: (1) work can be dynamically issued directly on the device, eliminating the need for synchronization with an external source, i.e., the CPU; (2) its three-tier dynamic scheduler supports arbitrary scheduling strategies, including dynamic priorities and real-time scheduling; and (3) the user can influence, pause, and cancel work already submitted for parallel execution. The Softshell processing model thus brings capabilities to GPU architectures that were previously only known from operating-system designs and reserved for CPU programming. As a proof of our claims, we present a publicly available implementation of the Softshell processing model realized on top of CUDA. The benchmarks of this implementation demonstrate that our processing model is easy to use and also performs substantially better than the state-of-the-art kernel-based processing model for problems that have been difficult to parallelize in the past.
Exploring Hybrid Memory for GPU Energy Efficiency through Software-Hardware Co-Design
"... Abstract—Hybrid memory designs, such as DRAM plus Phase Change Memory (PCM), have shown some promise for alleviating power and density issues faced by traditional memory systems. But previous studies have concentrated on CPU systems with a modest level of parallelism. This work studies the problem i ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Hybrid memory designs, such as DRAM plus Phase Change Memory (PCM), have shown some promise for alleviating power and density issues faced by traditional memory systems. But previous studies have concentrated on CPU systems with a modest level of parallelism. This work studies the problem in a massively parallel setting. Specifically, it investigates the special implications to hybrid memory imposed by the massive parallelism in GPU. It empirically shows that, contrary to promising results demonstrated for CPU, previous designs of PCM-based hybrid memory result in significant degradation to the energy efficiency of GPU. It reveals that the fundamental reason comes from a multi-facet mismatch between those designs and the massive parallelism in GPU. It presents a solution that centers around a close cooperation between compiler-directed data placement and hardware-assisted runtime adaptation. The co-design approach helps tap into the full potential of hybrid memory for GPU without requiring dramatic hardware changes over previous designs, yielding 6 % and 49 % energy saving on average compared to pure DRAM and pure PCM respectively, and keeping performance loss less than 2%.