Results 11 - 20
of
127
A proposal to extend the openmp tasking model for heterogeneous multicores.
- In 5th International Workshop of OpenMP (IWOMP
, 2009
"... Abstract. OpenMP has evolved recently towards expressing unstructured parallelism, targeting the parallelization of a broader range of applications in the current multicore era. Homogeneous multicore architectures from major vendors have become mainstream, but with clear indications that a better p ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
(Show Context)
Abstract. OpenMP has evolved recently towards expressing unstructured parallelism, targeting the parallelization of a broader range of applications in the current multicore era. Homogeneous multicore architectures from major vendors have become mainstream, but with clear indications that a better performance/power ratio can be achieved using more specialized hardware (accelerators), such as SSE-based units or GPUs, clearly deviating from the easy-to-understand shared-memory homogeneous architectures. This paper investigates if OpenMP could still survive in this new scenario and proposes a possible way to extend the current specification to reasonably integrate heterogeneity while preserving simplicity and portability. The paper leverages on a previous proposal that extended tasking with dependencies. The runtime is in charge of data movement, tasks scheduling based on these data dependencies and the appropriate selection of the target accelerator depending on system configuration and resource availability.
GRace: A low-overhead mechanism for detecting data races in GPU programs
- In PPoPP’11
, 2011
"... In recent years, GPUs have emerged as an extremely cost-effective means for achieving high performance. Many application developers, including those with no prior parallel programming experience, are now trying to scale their applications using GPUs. While languages like CUDA and OpenCL have eased G ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
(Show Context)
In recent years, GPUs have emerged as an extremely cost-effective means for achieving high performance. Many application developers, including those with no prior parallel programming experience, are now trying to scale their applications using GPUs. While languages like CUDA and OpenCL have eased GPU programming for non-graphical applications, they are still explicitly parallel languages. All parallel programmers, particularly the novices, need tools that can help ensuring the correctness of their programs. Like any multithreaded environment, data races on GPUs can severely affect the program reliability. Thus, tool support for detecting race conditions can significantly benefit GPU application developers. Existing approaches for detecting data races on CPUs or GPUs have one or more of the following limitations: 1) being illsuited for handling non-lock synchronization primitives on GPUs;
MapCG: writing parallel program portable between CPU and GPU
- In PACT ’10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques. ACM Request Permissions
, 2010
"... Graphics Processing Units (GPU) have been playing an im-portant role in the general purpose computing market re-cently. The common approach to program GPU today is to write GPU specific code with low level GPU APIs such as CUDA. Although this approach can achieve very good per-formance, it raises se ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
(Show Context)
Graphics Processing Units (GPU) have been playing an im-portant role in the general purpose computing market re-cently. The common approach to program GPU today is to write GPU specific code with low level GPU APIs such as CUDA. Although this approach can achieve very good per-formance, it raises serious portability issues: programmers are required to write a specific version of code for each po-tential target architecture. It results in high development and maintenance cost. We believe it is desired to have a programming model which provides source code portability between CPUs and GPUs, and different GPUs: Programmers only need to write one version of code and can be compiled and executed on either CPUs or GPUs efficiently without modification.
A framework for efficient and scalable execution of domain-specific templates on
"... Graphics Processing Units (GPUs) have emerged as important players in the transition of the computing industry from sequential to multi- and many-core computing. We propose a software framework for execution of domainspecific parallel templates on GPUs, which simultaneously raises the abstraction le ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
(Show Context)
Graphics Processing Units (GPUs) have emerged as important players in the transition of the computing industry from sequential to multi- and many-core computing. We propose a software framework for execution of domainspecific parallel templates on GPUs, which simultaneously raises the abstraction level of GPU programming and ensures efficient execution with forward scalability to large data sizes and new GPU platforms. To achieve scalable and efficient GPU execution, our framework focuses on two critical problems that have been largely ignored in previous efforts- processing large data sets that do not fit within the GPU memory, and minimizing data transfers between the host and GPU. Our framework takes domain-specific parallel programming templates that are expressed as parallel operator graphs, and performs operator splitting, offload unit identification, and scheduling of off-loaded computations and data transfers between the host and the GPU, to generate a highly optimized execution plan. Finally, a code generator produces a hybrid CPU/GPU program in accordance with the derived execution plan, that uses lowerlevel frameworks such as CUDA. We have applied the proposed framework to templates from the recognition domain, specifically edge detection kernels and convolutional neural networks that are commonly used in image and video analysis. We present results on two different GPU platforms from NVIDIA (a Tesla C870 GPU computing card and a GeForce 8800 graphics card) that demonstrate 1.7- 7.8X performance improvements over already accelerated baseline GPU implementations. We also demonstrate scalability to input data sets and application memory footprints of 6GB and 17GB, respectively, on GPU platforms with only 768MB and 1.5GB of memory. 1.
Dynamically Managed Data for CPU-GPU Architectures
"... GPUs are flexible parallel processors capable of accelerating real applications. To exploit them, programmers must ensure a consistent program state between the CPU and GPU memories by managing data. Manually managing data is tedious and error-prone. In prior work on automatic CPU-GPU data managemen ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
(Show Context)
GPUs are flexible parallel processors capable of accelerating real applications. To exploit them, programmers must ensure a consistent program state between the CPU and GPU memories by managing data. Manually managing data is tedious and error-prone. In prior work on automatic CPU-GPU data management, alias analysis quality limits performance, and type-inference quality limits applicability. This paper presents Dynamically Managed Data (DyManD), the first automatic system to manage complex and recursive data-structures without static analyses. By replacing static analyses with a dynamic run-time system, DyManD overcomes the performance limitations of alias analysis and enables management for complex and recursive data-structures. DyManD-enabled GPU parallelization matches the performance of prior work equipped with perfectly precise alias analysis for 27 programs and demonstrates improved applicability on programs not previously managed automatically. 1.
Adaptive Input-aware Compilation for Graphics Engines
, 2012
"... Whileg raphics processing units (GPUs) provide low-cost and efficient platforms for accelerating high performance computations, thetediousprocessofperformancetuningrequired tooptimizeapplicationsisanobstacletowideradoptionofGPUs. In addition totheprogrammabilitychallengesposed by GPU’s complex memor ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
Whileg raphics processing units (GPUs) provide low-cost and efficient platforms for accelerating high performance computations, thetediousprocessofperformancetuningrequired tooptimizeapplicationsisanobstacletowideradoptionofGPUs. In addition totheprogrammabilitychallengesposed by GPU’s complex memoryhierarchyandparallelismmodel, a well-known applicationdesignproblemistargetportability across different GPUs. However, evenforasingleGPUtarget, changingaprogram’s inputcharacteristicscanmakeanal ready-optimized implementationofaprogramperformpoorly. Inthiswork, weproposeAdaptic, an adaptive input-awarecompilationsystemtotacklethisimportant,yetoverlooked, inputportabilityproblem.Usingthissystem,programmers developtheirapplicationsinahigh-levelstreaminglanguageand letAdapticundertakethedifficulttaskofinputportableoptimizationsandcodegeneration.Severalinput-awareoptimizationsare introducedtomakeefficientuseofthememoryhierarchyandcustomizethreadcomposition.Atruntime,aproperlyoptimizedversion oftheapplicationisexecutedbasedontheactualprograminput.We performahead-to-headcomparisonbetweentheAdapticgenerated andhand-optimizedCUDAprograms.TheresultsshowthatAdaptic iscapableofgeneratingcodesthatcanperformonparwiththeir hand-optimizedcounterpartsovercertaininputrangesandout performthemwhentheinputfallsoutofthehand-optimizedprograms’ “comfort zone”.Furthermore,weshowthatinput-awareresultsare sustainableacrossdifferentGPUtargets making it possible to write and optimize applications once and run them anywhere.
Static Compilation Analysis for Host-Accelerator Communication Optimization
"... Abstract. We present an automatic, static program transformation that schedules and generates e cient memory transfers between a computer host and its hardware accelerator, addressing a well-known performance bottleneck. Our automatic approach uses two simple heuristics: to perform transfers to the ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
(Show Context)
Abstract. We present an automatic, static program transformation that schedules and generates e cient memory transfers between a computer host and its hardware accelerator, addressing a well-known performance bottleneck. Our automatic approach uses two simple heuristics: to perform transfers to the accelerator as early as possible and to delay transfers back from the accelerator as late as possible. We implemented this transformation as a middle-end compilation pass in the pips/Par4All compiler. In the generated code, redundant communications due to data reuse between kernel executions are avoided. Instructions that initiate transfers are scheduled e ectively at compile-time. We present experimental results obtained with the Polybench 2.0, some Rodinia benchmarks, and with a real numerical simulation. We obtain an average speedup of 4 to 5 when compared to a naïve parallelization using a modern gpu with Par4All, hmpp, and pgi, and 3.5 when compared to an OpenMP version using a 12-core multiprocessor.
Streamlining GPU Applications On the Fly —Thread Divergence Elimination through Runtime Thread-Data Remapping
"... Because of their tremendous computing power and remarkable cost efficiency, GPUs (graphic processing unit) have quickly emerged as a kind of influential platform for high performance computing. However, as GPUs are designed for massive data-parallel computing, their performance is subject to the pre ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
Because of their tremendous computing power and remarkable cost efficiency, GPUs (graphic processing unit) have quickly emerged as a kind of influential platform for high performance computing. However, as GPUs are designed for massive data-parallel computing, their performance is subject to the presence of condition statements in a GPU application. On a conditional branch where threads diverge in which path to take, the threads taking different paths have to run serially. Such divergences often cause serious performance degradations, impairing the adoption of GPU for many applications that contain non-trivial branches or certain types of loops. This paper presents a systematic investigation in the employment of runtime thread-data remapping for solving that problem. It introduces an abstract form of GPU applications, based on which, it describes the use of reference redirection and data layout transformation for remapping data and threads to minimize thread divergences. It discusses the major challenges for practical deployment of the remapping techniques, most notably, the conflict between the large remapping overhead and the need for the remapping to happen on the fly because of the dependence of thread divergences on runtime values. It offers a solution to the challenge by proposing a CPU-GPU pipelining scheme and a label-assign-move (LAM) algorithm to virtually hide all the remapping overhead. At the end, it reports significant performance improvement produced by the remapping for a set of GPU applications, demonstrating the potential of the techniques for streamlining GPU applications on the fly.
Designing a Unified Programming Model for Heterogeneous Machines
, 2012
"... Abstract—While high-efficiency machines are increasingly embracing heterogeneous architectures and massive multithreading, contemporary mainstream programming languages reflect a mental model in which processing elements are homogeneous, concurrency is limited, and memory is a flat undifferentiated ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
(Show Context)
Abstract—While high-efficiency machines are increasingly embracing heterogeneous architectures and massive multithreading, contemporary mainstream programming languages reflect a mental model in which processing elements are homogeneous, concurrency is limited, and memory is a flat undifferentiated pool of storage. Moreover, the current state of the art in programming heterogeneous machines tends towards using separate programming models, such as OpenMP and CUDA, for different portions of the machine. Both of these factors make programming emerging heterogeneous machines unnecessarily difficult. We describe the design of the Phalanx programming model, which seeks to provide a unified programming model for heterogeneous machines. It provides constructs for bulk parallelism, synchronization, and data placement which operate across the entire machine. Our prototype implementation is able to launch and coordinate work on both CPU and GPU processors within a single node, and by leveraging the GASNet runtime, is able to run across all the nodes of a distributed-memory machine. I.
The design and implementation Ocelot’s dynamic binary translator from PTX to multi-core x86
, 2009
"... Abstract—Ocelot is a dynamic compilation framework designed to map the explicitly parallel PTX execution model used by NVIDIA CUDA applications onto diverse many-core architectures. Ocelot includes a dynamic binary translator from PTX to many-core processors that leverages the LLVM code generator to ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Ocelot is a dynamic compilation framework designed to map the explicitly parallel PTX execution model used by NVIDIA CUDA applications onto diverse many-core architectures. Ocelot includes a dynamic binary translator from PTX to many-core processors that leverages the LLVM code generator to target x86. The binary translator is able to execute CUDA applications without recompilation and Ocelot can in fact dynamically switch between execution on an NVIDIA GPU and a many-core CPU. It has been validated against over 100 applications taken from the CUDA SDK [1], the UIUC Parboil benchmarks [2], the Virginia Rodinia benchmarks [3], the GPU-VSIPL signal and image processing library [4], and several domain specific applications. This paper presents a detailed description of the implementation of our binary translator highlighting design decisions and trade-offs, and showcasing their effect on application performance. We explore several code transformations that are applicable only when translating explicitly parallel applications and suggest additional optimization passes that may be useful to this class of applications. We expect this study to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures. I.