Results 1 - 10
of
127
Copperhead: Compiling an embedded data parallel language
- In Principles and Practices of Parallel Programming, PPoPP’11
, 2011
"... Modern parallel microprocessors deliver high performance on applications that expose substantial fine-grained data parallelism. Although data parallelism is widely available in many computations, implementing data parallel algorithms in low-level languages is often an unnecessarily difficult task. T ..."
Abstract
-
Cited by 62 (4 self)
- Add to MetaCart
(Show Context)
Modern parallel microprocessors deliver high performance on applications that expose substantial fine-grained data parallelism. Although data parallelism is widely available in many computations, implementing data parallel algorithms in low-level languages is often an unnecessarily difficult task. The characteristics of parallel microprocessors and the limitations of current programming methodologies motivate our design of Copperhead, a high-level data parallel language embedded in Python. The Copperhead programmer describes parallel computations via composition of familiar data parallel primitives supporting both flat and nested data parallel computation on arrays of data. Copperhead programs are expressed in a subset of the widely used Python programming language and interoperate with standard Python modules, including libraries for numeric computation, data visualization, and analysis. In this paper, we discuss the language, compiler, and runtime features that enable Copperhead to efficiently execute data parallel code. We define the restricted subset of Python which Copperhead supports and introduce the program analysis techniques necessary for compiling Copperhead code into efficient low-level implementations. We also outline the runtime support by which Copperhead programs interoperate with standard Python modules. We demonstrate the effectiveness of our techniques with several examples targeting the CUDA platform for parallel programming on GPUs. Copperhead code is concise, on average requiring 3.6 times fewer lines of code than CUDA, and the compiler generates efficient code, yielding 45-100 % of the performance of hand-crafted, well optimized CUDA code.
A GPGPU Compiler for Memory Optimization and Parallelism Management
- In Proceedings of PLDI
, 2010
"... This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism. The input to o ..."
Abstract
-
Cited by 53 (4 self)
- Add to MetaCart
(Show Context)
This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism. The input to our compiler is a naïve GPU kernel function, which is functionally correct but without any consideration for performance optimization. The compiler analyzes the code, identifies its memory access patterns, and generates both the optimized kernel and the kernel invocation parameters. Our optimization process includes vectorization and memory coalescing for memory bandwidth enhancement, tiling and unrolling for data reuse and parallelism management, and thread block remapping or addressoffset insertion for partition-camping elimination. The experiments on a set of scientific and media processing algorithms show that our optimized code achieves very high performance, either superior or very close to the highly fine-tuned library, NVIDIA CUBLAS 2.2, and up to 128 times speedups over the naive versions. Another distinguishing feature of our compiler is the understandability of the optimized code, which is useful for performance analysis and algorithm refinement.
Accelerating SQL Database Operations on a GPU with CUDA
"... Prior work has shown dramatic acceleration for various database operations on GPUs, but only using primitives that are not part of conventional database languages such as SQL. This paper implements a subset of the SQLite command processor directly on the GPU. This dramatically reduces the effort req ..."
Abstract
-
Cited by 48 (1 self)
- Add to MetaCart
(Show Context)
Prior work has shown dramatic acceleration for various database operations on GPUs, but only using primitives that are not part of conventional database languages such as SQL. This paper implements a subset of the SQLite command processor directly on the GPU. This dramatically reduces the effort required to achieve GPU acceleration by avoiding the need for database programmers to use new programming languages such as CUDA or modify their programs to use non-SQL libraries. This paper focuses on accelerating SELECT queries and describes the considerations in an efficient GPU implementation of the SQLite command processor. Results on an NVIDIA Tesla C1060 achieve speedups of 20-70X depending on the size of the result set.
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs
- in Proc. of Euro-Par
, 2009
"... Abstract While general-purpose homogeneous multi-core architectures are becoming ubiquitous, there are clear indications that, for a num-ber of important applications, a better performance/power ratio can be attained using specialized hardware accelerators. These accelerators re-quire specific SDK o ..."
Abstract
-
Cited by 48 (3 self)
- Add to MetaCart
(Show Context)
Abstract While general-purpose homogeneous multi-core architectures are becoming ubiquitous, there are clear indications that, for a num-ber of important applications, a better performance/power ratio can be attained using specialized hardware accelerators. These accelerators re-quire specific SDK or programming languages which are not always easy to program. Thus, the impact of the new programming paradigms on the programmer’s productivity will determine their success in the high-performance computing arena. In this paper we present GPU Superscalar (GPUSs), an extension of the Star Superscalar programming model that targets the parallelization of applications on platforms consisting of a general-purpose processor connected with multiple graphics processors. GPUSs deals with architecture heterogeneity and separate memory ad-dress spaces, while preserving simplicity and portability. Preliminary ex-perimental results for a well-known operation in numerical linear algebra illustrate the correct adaptation of the runtime to a multi-GPU system, attaining notable performance results. Key words: Task-level parallelism, graphics processors, heterogeneous sys-tems, programming models 1
On-the-Fly Elimination of Dynamic Irregularities for GPU Computing
"... The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for scientific computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control flows in an application. Experiments have shown gre ..."
Abstract
-
Cited by 42 (9 self)
- Add to MetaCart
(Show Context)
The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for scientific computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control flows in an application. Experiments have shown great performance gains when these irregularities are removed. But it remains an open question how to achieve those gains through software approaches on real GPUs. In this paper, we present a systematic exploration to tackle dynamic irregularities in both control flows and memory references. We report findings on their inherent properties, including interactions among different types of irregularities, their relations with program data and threads, the computational complexities in removing them, and heuristics-based algorithms for their removal through data reordering, job swapping, and hybrid transformations. Based on these findings, we develop a framework, named G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. G-Streamline has several distinctive properties. It is a pure software solution and works on the fly, requiring no hardware extensions or offline profiling. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations. Its optimization overhead is largely transparent to GPU kernel executions, jeopardizing no basic efficiency of the GPU application. Finally, it is robust to the presence of various complexities in GPU applications. Our experiments results demonstrate that G-Streamline is effective in reducing both types of dynamic irregularities and is capable of producing significant performance improvements for a variety of applications.
Sponge: Portable stream programming on graphics engines
- In ASPLOS’11
, 2011
"... Graphics processing units (GPUs) provide a low cost platform for accelerating high performance computations. The introduction of new programming languages, such as CUDA and OpenCL, makes GPU programming attractive to a wide variety of programmers. However, programming GPUs is still a cumbersome task ..."
Abstract
-
Cited by 36 (7 self)
- Add to MetaCart
(Show Context)
Graphics processing units (GPUs) provide a low cost platform for accelerating high performance computations. The introduction of new programming languages, such as CUDA and OpenCL, makes GPU programming attractive to a wide variety of programmers. However, programming GPUs is still a cumbersome task for two primary reasons: tedious performance optimizations and lack of portability. First, optimizing an algorithm for a specific GPU is a time-consuming task that requires a thorough understanding of both the algorithm and the underlying hardware. Unoptimized CUDA programs typically only achieve a small fraction of the peak GPU performance. Second, GPU code lacks efficient portability as code written for one GPU can be inefficient when executed on another. Moving code from one GPU to another while maintaining the desired performance is a non-trivial task often requiring significant modifications to account for the hardware differences. In this work, we propose Sponge, a compilation framework for GPUs using synchronous data flow streaming languages. Sponge is capable of performing a wide variety of optimizations to generate efficient code for graphics engines. Sponge alleviates the problems associated with current GPU programming methods by providing portability across different generations of GPUs and CPUs, and a better abstraction of the hardware details, such as the memory hierarchy and threading model. Using streaming, we provide a writeonce software paradigm and rely on the compiler to automatically create optimized CUDA code for a wide variety of GPU targets. Sponge’s compiler optimizations improve the performance of the baseline CUDA implementations by an average of 3.2x.
JCUDA: A programmerfriendly interface for accelerating Java programs with CUDA
- In Proc. the 15th Euro-Par
, 2009
"... Abstract. A recent trend in mainstream desktop systems is the use of general-purpose graphics processor units (GPGPUs) to obtain order-of-magnitude performance improvements. CUDA has emerged as a popu-lar programming model for GPGPUs for use by C/C++ programmers. Given the widespread use of modern o ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
(Show Context)
Abstract. A recent trend in mainstream desktop systems is the use of general-purpose graphics processor units (GPGPUs) to obtain order-of-magnitude performance improvements. CUDA has emerged as a popu-lar programming model for GPGPUs for use by C/C++ programmers. Given the widespread use of modern object-oriented languages with man-aged runtimes like Java and C#, it is natural to explore how CUDA-like capabilities can be made accessible to those programmers as well. In this paper, we present a programming interface called JCUDA that can be used by Java programmers to invoke CUDA kernels. Using this interface, programmers can write Java codes that directly call CUDA kernels, and delegate the responsibility of generating the Java-CUDA bridge codes and host-device data transfer calls to the compiler. Our preliminary per-formance results show that this interface can deliver significant perfor-mance improvements to Java programmers. For future work, we plan to use the JCUDA interface as a target language for supporting higher level parallel programming languages like X10 and Habanero-Java. 1
Cache-conscious wavefront scheduling
- In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’12
, 2012
"... This paper studies the effects of hardware thread scheduling on cache management in GPUs. We propose Cache-Conscious Wave-front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wavefront locality detector to capture lo-cality that is lost by other schedulers due to e ..."
Abstract
-
Cited by 32 (2 self)
- Add to MetaCart
(Show Context)
This paper studies the effects of hardware thread scheduling on cache management in GPUs. We propose Cache-Conscious Wave-front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wavefront locality detector to capture lo-cality that is lost by other schedulers due to excessive contention for cache capacity. In contrast to improvements in the replace-ment policy that can better tolerate difficult access patterns, CCWS shapes the access pattern to avoid thrashing the shared L1. We show that CCWS can outperform any replacement scheme by evaluating against the Belady-optimal policy. Our evaluation demonstrates that cache efficiency and preservation of intra-wavefront locality become more important as GPU computing expands beyond use in high per-formance computing. At an estimated cost of 0.17 % total chip area, CCWS reduces the number of threads actively issued on a core when appropriate. This leads to an average 25 % fewer L1 data cache misses which results in a harmonic mean 24 % performance improve-ment over previously proposed scheduling policies across a diverse selection of cache-sensitive workloads. 1.
Mint: Realizing CUDA Performance in 3D Stencil Methods with Annotated C
- In Proceedings of the 25th International Conference on Supercomputing (ICS’11
, 2011
"... We present Mint, a programming model that enables the non-expert to enjoy the performance benefits of hand coded CUDA without becoming entangled in the details. Mint targets stencil methods, which are an important class of scientific applications. We have implemented the Mint programming model with ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
(Show Context)
We present Mint, a programming model that enables the non-expert to enjoy the performance benefits of hand coded CUDA without becoming entangled in the details. Mint targets stencil methods, which are an important class of scientific applications. We have implemented the Mint programming model with a source-to-source translator that generates optimized CUDA C from traditional C source. The translator relies on annotations to guide translation at a high level. The set of pragmas is small, and the model is compact and simple. Yet, Mint is able to deliver performance competitive with painstakingly hand-optimized CUDA. We show that, for a set of widely used stencil kernels, Mint realized 80 % of the performance obtained from aggressively optimized CUDA on the 200 series NVIDIA GPUs. Our optimizations target three dimensional kernels, which present a daunting array of optimizations.
Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations
- In ICS
, 2010
"... A trend that has materialized, and has given rise to much atten-tion, is of the increasingly heterogeneous computing platforms. Presently, it has become very common for a desktop or a note-book computer to come equipped with both a multi-core CPU and a GPU. Capitalizing on the maximum computational ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
(Show Context)
A trend that has materialized, and has given rise to much atten-tion, is of the increasingly heterogeneous computing platforms. Presently, it has become very common for a desktop or a note-book computer to come equipped with both a multi-core CPU and a GPU. Capitalizing on the maximum computational power of such architectures (i.e., by simultaneously exploiting both the multi-core CPU and the GPU) starting from a high-level API is a critical chal-lenge. We believe that it would be highly desirable to support a simple way for programmers to realize the full potential of today’s heterogeneous machines. This paper describes a compiler and runtime framework that can map a class of applications, namely those characterized by gener-alized reductions, to a system with a multi-core CPU and GPU. Starting with simple C functions with added annotations, we au-tomatically generate the middleware API code for the multi-core, as well as CUDA code to exploit the GPU simultaneously. The runtime system provides efficient schemes for dynamically parti-tioning the work between CPU cores and the GPU. Our experi-mental results from two applications, e.g., k-means clustering and Principal Component Analysis (PCA), show that, through effec-tively harnessing the heterogeneous architecture, we can achieve significantly higher performance compared to using only the GPU or the multi-core CPU. In k-means, the heterogeneous version with 8 CPU cores and a GPU achieved a speedup of about 32.09x rel-ative to 1-thread CPU. When compared to the faster of CPU-only and GPU-only executions, we were able to achieve a performance gain of about 60%. In PCA, the heterogeneous version attained a speedup of 10.4x relative to the 1-thread CPU version. When compared to the faster of CPU-only and GPU-only versions, we achieved a performance gain of about 63.8%.