• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Owl: cooperative thread array aware scheduling techniques for improving gpgpu performance. (2013)

by Adwait Jog
Venue:ACM SIGARCH,
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 18
Next 10 →

A Locality-Aware Memory Hierarchy for Energy-Efficient GPU Architectures

by Minsoo Rhu, Michael Sullivan, Jingwen Leng, Mattan Erez
"... As GPU’s compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. Current GPU memory hierarchies use coarse-grained memory accesses to exploit spatial locality, maximize peak bandwidth, simplify control, and reduce cache meta-data storage. These coarse-grained memory acce ..."
Abstract - Cited by 6 (1 self) - Add to MetaCart
As GPU’s compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. Current GPU memory hierarchies use coarse-grained memory accesses to exploit spatial locality, maximize peak bandwidth, simplify control, and reduce cache meta-data storage. These coarse-grained memory accesses, however, are a poor match for emerging GPU applications with irregular control flow and memory access patterns. Meanwhile, the massive multi-threading of GPUs and the simplicity of their cache hierarchies make CPU-specific memory system enhancements ineffective for improving the performance of irregular GPU applications. We design and evaluate a locality-aware memory hierarchy for throughput processors, such as GPUs. Our proposed design retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine-grained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality. As such, our locality-aware memory hierarchy improves GPU performance, energy-efficiency, and memory throughput for a large range of applications.
(Show Context)

Citation Context

...inefficient utilization of off-chip bandwidth and compute resources [1, 2, 3]. Recent proposals have primarily focused on overcoming irregularity by improving device utilization and latency tolerance =-=[4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]-=-, but the memory bandwidth bottleneck still remains as a significant issue in future throughput computing [16]. Coarse-grained memory accesses waste off-chip bandwidth and limit the energy-efficiency ...

D2MA: Accelerating Coarse-Grained Data Transfer for GPUs

by D. Anoushe Jamshidi, Mehrzad Samadi, Scott Mahlke
"... To achieve high performance on many-core architectures like GPUs, it is crucial to efficiently utilize the available mem-ory bandwidth. Currently, it is common to use fast, on-chip scratchpad memories, like the shared memory available on GPUs ’ shader cores, to buffer data for computation. This buff ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
To achieve high performance on many-core architectures like GPUs, it is crucial to efficiently utilize the available mem-ory bandwidth. Currently, it is common to use fast, on-chip scratchpad memories, like the shared memory available on GPUs ’ shader cores, to buffer data for computation. This buffering, however, has some sources of inefficiency that hin-der it from most efficiently utilizing the available memory resources. These issues stem from shader resources being used for repeated, regular address calculations, a need to shuffle data multiple times between a physically unified on-chip memory, and forcing all threads to synchronize to en-sure RAW consistency based on the speed of the slowest threads. To address these inefficiencies, we propose Data-Parallel DMA, or D2MA. D2MA is a reimagination of tradi-tional DMA that addresses the challenges of extending DMA to thousands of concurrently executing threads. D2MA de-couples address generation from the shader’s computational resources, provides a more direct and efficient path for data in global memory to travel into the shared memory, and introduces a novel dynamic synchronization scheme that is transparent to the programmer. These advancements allow D2MA to achieve speedups as high as 2.29x, and reduces the average time to buffer data by 81 % on average.
(Show Context)

Citation Context

...tch data. Due to this, D2MA does not suffer from inaccuracy and only fetches data when instructed. Different warp schedulers have also been introduced to improve the utilization of GPU memory systems =-=[18, 9, 15, 20, 30, 14]-=-. Lakshminarayana and Kim et al [18] evaluated various scheduling techniques for DRAM optimization. For systems with no hardware-managed caches, they proposed a scheduler which is fair to all warps. G...

CAWS: criticality-aware warp scheduling for GPGPU workloads

by Shin-ying Lee, Carole-jean Wu - in Proceedings of the 23rd International Conference on Parallel Architectures and Compilation , 2014
"... The ability to perform fast context-switching and massive multi-threading is the forte of modern GPU architectures, which have emerged as an efficient alternative to traditional chip-multiprocessors for parallel workloads. One of the main benefits of such architecture is its latency-hiding capabilit ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
The ability to perform fast context-switching and massive multi-threading is the forte of modern GPU architectures, which have emerged as an efficient alternative to traditional chip-multiprocessors for parallel workloads. One of the main benefits of such architecture is its latency-hiding capability. However, the efficacy of GPU’s latency-hiding varies signif-icantly across GPGPU applications. To investigate this, this paper first proposes a new al-gorithm that profiles execution behavior of GPGPU applica-tions. We characterize latencies caused by various pipeline hazards, memory accesses, synchronization primitives, and the warp scheduler. Our results show that the current round-robin warp scheduler works well in overlapping various la-tency stalls with the execution of other available warps for
(Show Context)

Citation Context

...he longer running warps in a thread block. Many prior works have looked at various warp scheduling algorithms to improve GPU performance, e.g., prefetch-aware scheduling [12], memory-aware scheduling =-=[9, 10, 12, 11, 16, 18]-=-. However, to the best of our knowledge, this work is the first to characterize warp criticality and explore different criticality-aware warp scheduling (CAWS) algorithms for modern GPU architectures....

Characterizing the Latency Hiding Ability of GPUs

by Shin-ying Lee, Carole-jean Wu - Proc. of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software , 2014
"... The ability to perform fast context-switching and mas-sive multi-threading has been the forte of modern Graphics Processing Unit (GPU) architectures, which have emerged as an efficient alternative to traditional chip multiprocessors ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
The ability to perform fast context-switching and mas-sive multi-threading has been the forte of modern Graphics Processing Unit (GPU) architectures, which have emerged as an efficient alternative to traditional chip multiprocessors
(Show Context)

Citation Context

...HRs) can cause additional penalty. lud, in particular, experiences a significant amount of delay by the unavailability of the MSHRs. This is because warps in lud, a memory-bandwidth intensive program =-=[4, 6]-=-, often request data from the memory in a bursty manner. As a result, the performance of lud is significantly degraded by the MSHR contention. Such contention happens to srad_2 and bfs as well. Our st...

Application-aware Memory System for Fair and Efficient Execution for Concurrent GPGPU Applications

by Adwait Jog, Evgeny Bolotin, Zvika Guz, Mike Parker, Stephen W. Keckler, Mahmut T. K, Emir Chita R. Das - in Proceedings of GPGPU-7 , 2014
"... The available computing resources in modern GPUs are growing with each new generation. However, as many gen-eral purpose applications with limited thread-scalability are tuned to take advantage of GPUs, available compute re-sources might not be optimally utilized. To address this, modern GPUs will n ..."
Abstract - Cited by 2 (1 self) - Add to MetaCart
The available computing resources in modern GPUs are growing with each new generation. However, as many gen-eral purpose applications with limited thread-scalability are tuned to take advantage of GPUs, available compute re-sources might not be optimally utilized. To address this, modern GPUs will need to execute multiple kernels simul-taneously. As current generations of GPUs (e.g., NVIDIA Kepler, AMD Radeon) already enable concurrent execution of kernels from the same application, in this paper we ad-dress the next logical step: executing multiple concurrent applications in GPUs. We show that while this paradigm has a potential to improve the overall system performance, negative interactions among concurrently executing applica-tions in the memory system can severely hamper the per-
(Show Context)

Citation Context

...scheduling to improve the caching efficiency in GPUs. Gebhart and Johnson et al. [8] proposed a two-level warp scheduling technique that focuses on reducing the energy consumption in GPUs. Jog et al. =-=[12]-=- proposed a series of CTA-aware warp scheduling techniques to reduce cache and memory contention. Kayiran et al. [13] modulated the available thread-level parallelism by intelligent CTA scheduling. Jo...

Warp-Aware Trace Scheduling for GPUs

by James A. Jablin, Thomas B. Jablin, Onur Mutlu, Maurice Herlihy
"... GPU performance depends not only on thread/warp level parallelism (TLP) but also on instruction-level parallelism (ILP). It is not enough to schedule instructions within ba-sic blocks, it is also necessary to exploit opportunities for ILP optimization beyond branch boundaries. Unfortunately, modern ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
GPU performance depends not only on thread/warp level parallelism (TLP) but also on instruction-level parallelism (ILP). It is not enough to schedule instructions within ba-sic blocks, it is also necessary to exploit opportunities for ILP optimization beyond branch boundaries. Unfortunately, modern GPUs cannot dynamically carry out such optimiza-tions because they lack hardware branch prediction and can-not speculatively execute instructions beyond a branch. We propose to circumvent these limitations by adapting Trace Scheduling, a technique originally developed for mi-crocode optimization. Trace Scheduling divides code into traces (or paths), and optimizes each trace in a context-independent way. Adapting Trace Scheduling to GPU code requires revisiting and revising each step of microcode Trace Scheduling to attend to branch and warp behavior, identi-fying instructions on the critical path, avoiding warp diver-gence, and reducing divergence time. Here, we propose “Warp-Aware Trace Scheduling ” for GPUs. As evaluated on the Rodinia Benchmark Suite using dynamic profiling, our fully-automatic optimization achieves a geometric mean speedup of 1.10 × on a real system by increasing instructions executed per cycle (IPC) by a har-monic mean of 1.12 × and reducing instruction serialization and total instructions executed.
(Show Context)

Citation Context

...put. Each new generation of GPUs provides increasing levels of resources such as more registers, shared memory, functional units, and arithmetic cores. Most efforts to manage GPU resource utilization =-=[17, 20, 25, 31, 32]-=- have focused on thread or warp-level parallelism (TLP). Our contribution in this paper is to focus attention on the complementary use of instruction-level parallelism (ILP) to improve resource utiliz...

Adaptive and Transparent Cache Bypassing for GPUs

by Ang Li, et al. , 2015
"... In the last decade, GPUs have emerged to be widely adopted for general-purpose applications. To capture on-chip locality for these applications, modern GPUs have integrated multi-level cache hierarchy, in an attempt to reduce the amount and latency of the massive and sometimes irregular mem-ory acce ..."
Abstract - Add to MetaCart
In the last decade, GPUs have emerged to be widely adopted for general-purpose applications. To capture on-chip locality for these applications, modern GPUs have integrated multi-level cache hierarchy, in an attempt to reduce the amount and latency of the massive and sometimes irregular mem-ory accesses. However, inferior performance is frequently attained due to serious congestion in the caches results from the huge amount of concurrent threads. In this paper, we propose a novel compile-time framework for adaptive and transparent cache bypassing on GPUs. It uses a simple yet effective approach to control the bypass degree to match the size of applications ’ runtime footprints. We validate the de-sign on seven GPU platforms that cover all existing GPU generations using 16 applications from widely used GPU benchmarks. Experiments show that our design can signifi-cantly mitigate the negative impact due to small cache sizes and improve the overall performance. We analyze the perfor-mance across different platforms and applications. We also propose some optimization guidelines on how to efficiently use the GPU caches.

A Survey of Techniques for Managing and . . .

by Sparsh Mittal - JOURNAL OF CIRCUITS, SYSTEMS, AND COMPUTERS , 2014
"... Initially introduced as special-purpose accelerators for graphics applications, GPUs have now emerged as general purpose computing platforms for a wide range of applications. To address the requirements of these applications, modern GPUs include sizable hardware-managed caches. However, several fact ..."
Abstract - Add to MetaCart
Initially introduced as special-purpose accelerators for graphics applications, GPUs have now emerged as general purpose computing platforms for a wide range of applications. To address the requirements of these applications, modern GPUs include sizable hardware-managed caches. However, several factors, such as unique architecture of GPU, rise of CPU-GPU heterogeneous computing etc., demand effective management of caches to achieve high performance and energy efficiency. Recently, several techniques have been proposed for this purpose. In this paper, we survey several architectural and system-level techniques proposed for managing and leveraging GPU caches. We also discuss the importance and challenges of cache management in GPUs. The aim of this paper is to provide the readers insights into cache management techniques for GPUs and motivate them to propose even better techniques for leveraging the full potential of caches in the GPUs of tomorrow.

Improving the Programmability of GPU Architectures

by Cedric Nugteren , 2014
"... ..."
Abstract - Add to MetaCart
Abstract not found

Mascar: Speeding up GPU Warps by Reducing Memory Pitstops

by Ankit Sethia , D Anoushe Jamshidi , Scott Mahlke
"... Abstract-With the prevalence of GPUs as throughput engines for data parallel workloads, the landscape of GPU computing is changing significantly. Non-graphics workloads with high memory intensity and irregular access patterns are frequently targeted for acceleration on GPUs. While GPUs provide larg ..."
Abstract - Add to MetaCart
Abstract-With the prevalence of GPUs as throughput engines for data parallel workloads, the landscape of GPU computing is changing significantly. Non-graphics workloads with high memory intensity and irregular access patterns are frequently targeted for acceleration on GPUs. While GPUs provide large numbers of compute resources, the resources needed for memory intensive workloads are more scarce. Therefore, managing access to these limited memory resources is a challenge for GPUs. We propose a novel Memory Aware Scheduling and Cache Access Re-execution (Mascar) system on GPUs tailored for better performance for memory intensive workloads. This scheme detects memory saturation and prioritizes memory requests among warps to enable better overlapping of compute and memory accesses. Furthermore, it enables limited re-execution of memory instructions to eliminate structural hazards in the memory subsystem and take advantage of cache locality in cases where requests cannot be sent to the memory due to memory saturation. Our results show that Mascar provides a 34% speedup over the baseline roundrobin scheduler and 10% speedup over the state of the art warp schedulers for memory intensive workloads. Mascar also achieves an average of 12% savings in energy for such workloads.
(Show Context)

Citation Context

...GPU workloads tend to have very regular, streaming memory access patterns, recent research has examined GPU applications that benefit from cache locality [27], [19] and have more non-streaming accesses. If this data locality is not exploited, cache thrashing will occur, causing performance degradation. In this work, we establish that the warp scheduler present in a GPU’s Streaming Multiprocessor (SM) plays a pivotal role in achieving high performance for memory intensive workloads, specifically by prioritizing memory requests from one warp over those of others. While recent work by Jog et al. [15] has shown that scheduling to improve cache and memory locality leads to better performance, we stipulate that the role of scheduling is not limited to workloads which have such locality. We show that scheduling is also critical in improving the performance of many memory intensive workloads that do not exhibit data locality. We propose Memory Aware Scheduling and Cache Access Re-execution (Mascar) to better overlap computation and memory accesses for memory intensive workloads. The intuition behind Mascar is that when the memory subsystem is saturated, all the memory requests of one warp shou...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University