Results 1 - 10
of
11
Rodinia: A Benchmark Suite for Heterogeneous Computing
"... Abstract—This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applicat ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
Abstract—This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeley’s dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout. I.
A Characterization of the Rodinia Benchmark Suite with Comparison to Contemporary CMP Workloads
"... Abstract—The recently released Rodinia benchmark suite enables users to evaluate heterogeneous systems including both accelerators, such as GPUs, and multicore CPUs. As Rodinia sees higherlevelsofacceptance,itbecomesimportantthatresearchers understand this new set of benchmarks, especially in how th ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract—The recently released Rodinia benchmark suite enables users to evaluate heterogeneous systems including both accelerators, such as GPUs, and multicore CPUs. As Rodinia sees higherlevelsofacceptance,itbecomesimportantthatresearchers understand this new set of benchmarks, especially in how they differ from previous work. In this paper, we present recent extensions to Rodinia and conduct a detailed characterization of the Rodinia benchmarks (including performance results on an NVIDIA GeForce GTX480, the first product released based on the Fermi architecture). We also compare and contrast Rodinia with Parsec to gain insights into the similarities and differences of the two benchmark collections; we apply principal component analysis to analyze the application space coverage of the two suites. Our analysis shows that many of the workloads in Rodinia and Parsec are complementary, capturing different aspects of certain performance metrics. I.
Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model
"... GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architect ..."
Abstract
- Add to MetaCart
GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications. To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4 % and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language. 1.
BREAKING THE MEMORY WALL FOR HIGHLY MULTI-THREADED CORES
, 2010
"... Emerging applications such as scientific computation, media processing, machine learning and data mining are commonly computation- and data- intensive [1], and they usually exhibit abundant parallelism. These applications motivate the design of throughputoriented many- and multi-core architectures t ..."
Abstract
- Add to MetaCart
Emerging applications such as scientific computation, media processing, machine learning and data mining are commonly computation- and data- intensive [1], and they usually exhibit abundant parallelism. These applications motivate the design of throughputoriented many- and multi-core architectures that employ many small and simple cores and scale up to large thread counts. The cores themselves are also typically multi-threaded. Such organizations are sometimes referred to as Chip Multi-Threading (CMT). However, with tens or hundreds of concurrent threads running on a single chip, throughput is not limited by the computation resources, but by the overhead in data movement. In fact, adding more cores or threads is likely to harm performance due to contention in the memory system. It is therefore important to improve data management to either reduce or tolerate data movement and associated latencies. Several techniques have been proposed for conventional multi-core organizations. However, the large thread count per core in CMT poses new challenges: much lower cache capacity per thread, nonuniform overhead in thread communication, inability of aggressive out-of-order execution to hide latency, and additional memory latency caused by SIMD constraints. To address the above challenges, we propose several techniques. Some reduce contention in either private caches or shared caches; some identify the right amount of computation to replicate to reduce communication, and some reconfigure SIMD architectures at runtime. Our objective is to allow CMTs 4 to achieve scalable performance along with the thread count, despite limited energy budget for their memory system. 5
Dynamic Heterogeneous Scheduling Decisions Using Historical Runtime Data
"... Abstract. Heterogeneous systems often employ processing units with a wide spectrum of performance capabilities. Allowing individual applications to make greedy local scheduling decisions leads to imbalance, with underutilization of some devices and excessive contention for others. If we instead allo ..."
Abstract
- Add to MetaCart
Abstract. Heterogeneous systems often employ processing units with a wide spectrum of performance capabilities. Allowing individual applications to make greedy local scheduling decisions leads to imbalance, with underutilization of some devices and excessive contention for others. If we instead allow the system to make global scheduling decisions and assign some applications to a slower device, we can both increase overall system throughput and decrease individual application runtimes. We present a method for dynamically scheduling applications running on heterogeneous platforms in order to maximize overall throughput. The key to our approach is accurately estimating when an application would finish execution on a given device based on historical runtime information, allowing us to make scheduling decisions that are both globally and locally efficient. We evaluate our approach with a set of OpenCL applications running on a system with a multicore CPU and a discrete GPU. We show that scheduling decisions based on historical data can decrease the total runtime by 39 % over GPU-only scheduling and 29 % over scheduling that places each application on its preferred device. 1
Hierarchical Overlapped Tiling
"... This paper introduces hierarchical overlapped tiling, a transformation that applies loop tiling and fusion to conventional loops. Overlapped tiling is a useful transformation to reduce communication overhead, but it may also generate a significant amount of redundant computation. Hierarchical overla ..."
Abstract
- Add to MetaCart
This paper introduces hierarchical overlapped tiling, a transformation that applies loop tiling and fusion to conventional loops. Overlapped tiling is a useful transformation to reduce communication overhead, but it may also generate a significant amount of redundant computation. Hierarchical overlapped tiling performs overlapped tiling hierarchically to balance communication overhead and redundant computation, and thus has the potential to provide better performance. In this paper, we describe the hierarchical overlapped tiling optimization and its implementation in an OpenCL compiler. We also evaluate the effectiveness of this optimization using 8 programs that implement different forms of stencil computation. Our results show that hierarchical overlapped tiling achieves an average 37 % speedup over traditional tiling on a 32-core workstation. Categories and Subject Descriptors
AdaptiveInput-awareCompilationforGraphicsEngines
"... Whilegraphicsprocessingunits(GPUs)providelow-costandefficientplatformsforacceleratinghighperformancecomputations, thetediousprocessofperformancetuningrequiredtooptimizeapplicationsisanobstacletowideradoptionofGPUs.Inaddition totheprogrammabilitychallengesposedbyGPU’scomplexmemoryhierarchyandparallel ..."
Abstract
- Add to MetaCart
Whilegraphicsprocessingunits(GPUs)providelow-costandefficientplatformsforacceleratinghighperformancecomputations, thetediousprocessofperformancetuningrequiredtooptimizeapplicationsisanobstacletowideradoptionofGPUs.Inaddition totheprogrammabilitychallengesposedbyGPU’scomplexmemoryhierarchyandparallelismmodel,awell-knownapplicationdesignproblemistargetportabilityacrossdifferentGPUs.However, evenforasingleGPUtarget,changingaprogram’sinputcharacteristicscanmakeanalready-optimizedimplementationofaprogramperformpoorly.Inthiswork,weproposeAdaptic,anadaptive input-awarecompilationsystemtotacklethisimportant,yetoverlooked,inputportabilityproblem.Usingthissystem,programmers developtheirapplicationsinahigh-levelstreaminglanguageand letAdapticundertakethedifficulttaskofinputportableoptimizationsandcodegeneration.Severalinput-awareoptimizationsare introducedtomakeefficientuseofthememoryhierarchyandcustomizethreadcomposition.Atruntime,aproperlyoptimizedversion oftheapplicationisexecutedbasedontheactualprograminput.We performahead-to-headcomparisonbetweentheAdapticgenerated andhand-optimizedCUDAprograms.TheresultsshowthatAdaptic iscapableofgeneratingcodesthatcanperformonparwiththeir hand-optimizedcounterpartsovercertaininputrangesandoutperformthemwhentheinputfallsoutofthehand-optimizedprograms’ “comfortzone”.Furthermore,weshowthatinput-awareresultsare sustainableacrossdifferentGPUtargetsmakingitpossibletowrite andoptimizeapplicationsonceandrunthemanywhere. Categories andSubjectDescriptors D.3.4 [Programming Languages]: Processors—Compilers
A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications
"... Tuning code for GPGPU and other emerging many-core platforms is a challenge because few models or tools can precisely pinpoint the root cause of performance bottlenecks. In this paper, we present a performance analysis framework that can help shed light on such bottlenecks for GPGPU applications. Al ..."
Abstract
- Add to MetaCart
Tuning code for GPGPU and other emerging many-core platforms is a challenge because few models or tools can precisely pinpoint the root cause of performance bottlenecks. In this paper, we present a performance analysis framework that can help shed light on such bottlenecks for GPGPU applications. Although a handful of GPGPU profiling tools exist, most of the traditional tools, unfortunately, simply provide programmers with a variety of measurements and metrics obtained by running applications, and it is often difficult to map these metrics to understand the root causes of slowdowns, much less decide what next optimization step to take to alleviate the bottleneck. In our approach, we first develop an analytical performance model that can precisely predict performance and aims to provide programmer-interpretable metrics. Then, we apply static and dynamic profiling to instantiate our performance model for a particular input code and show how the model can predict the potential performance benefits. We demonstrate our framework on a suite of micro-benchmarks as well as a variety of computations extracted from real codes.
Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters ∗
"... This paper develops and evaluates search and optimization techniques for auto-tuning 3D stencil (nearest-neighbor) computations on GPUs. Observations indicate that parameter tuning is necessary for heterogeneous GPUs to achieve optimal performance with respect to a search space. Our proposed framewo ..."
Abstract
- Add to MetaCart
This paper develops and evaluates search and optimization techniques for auto-tuning 3D stencil (nearest-neighbor) computations on GPUs. Observations indicate that parameter tuning is necessary for heterogeneous GPUs to achieve optimal performance with respect to a search space. Our proposed framework takes a most concise specification of stencil behavior from the user as a single formula, auto-generates tunable code from it, systematically searches for the best configuration and generates the code with optimal parameter configurations for different GPUs. This auto-tuning approach guarantees adaptive performance for different generations of GPUs while greatly enhancing programmer productivity. Experimental results show that the delivered floating point performance is very close to previous handcrafted work and outperforms other auto-tuned stencil codes by a large margin. 1.

