Results 1 - 10
of
18
A Locality-Aware Memory Hierarchy for Energy-Efficient GPU Architectures
"... As GPU’s compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. Current GPU memory hierarchies use coarse-grained memory accesses to exploit spatial locality, maximize peak bandwidth, simplify control, and reduce cache meta-data storage. These coarse-grained memory acce ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
As GPU’s compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. Current GPU memory hierarchies use coarse-grained memory accesses to exploit spatial locality, maximize peak bandwidth, simplify control, and reduce cache meta-data storage. These coarse-grained memory accesses, however, are a poor match for emerging GPU applications with irregular control flow and memory access patterns. Meanwhile, the massive multi-threading of GPUs and the simplicity of their cache hierarchies make CPU-specific memory system enhancements ineffective for improving the performance of irregular GPU applications. We design and evaluate a locality-aware memory hierarchy for throughput processors, such as GPUs. Our proposed design retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine-grained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality. As such, our locality-aware memory hierarchy improves GPU performance, energy-efficiency, and memory throughput for a large range of applications.
D2MA: Accelerating Coarse-Grained Data Transfer for GPUs
"... To achieve high performance on many-core architectures like GPUs, it is crucial to efficiently utilize the available mem-ory bandwidth. Currently, it is common to use fast, on-chip scratchpad memories, like the shared memory available on GPUs ’ shader cores, to buffer data for computation. This buff ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
To achieve high performance on many-core architectures like GPUs, it is crucial to efficiently utilize the available mem-ory bandwidth. Currently, it is common to use fast, on-chip scratchpad memories, like the shared memory available on GPUs ’ shader cores, to buffer data for computation. This buffering, however, has some sources of inefficiency that hin-der it from most efficiently utilizing the available memory resources. These issues stem from shader resources being used for repeated, regular address calculations, a need to shuffle data multiple times between a physically unified on-chip memory, and forcing all threads to synchronize to en-sure RAW consistency based on the speed of the slowest threads. To address these inefficiencies, we propose Data-Parallel DMA, or D2MA. D2MA is a reimagination of tradi-tional DMA that addresses the challenges of extending DMA to thousands of concurrently executing threads. D2MA de-couples address generation from the shader’s computational resources, provides a more direct and efficient path for data in global memory to travel into the shared memory, and introduces a novel dynamic synchronization scheme that is transparent to the programmer. These advancements allow D2MA to achieve speedups as high as 2.29x, and reduces the average time to buffer data by 81 % on average.
CAWS: criticality-aware warp scheduling for GPGPU workloads
- in Proceedings of the 23rd International Conference on Parallel Architectures and Compilation
, 2014
"... The ability to perform fast context-switching and massive multi-threading is the forte of modern GPU architectures, which have emerged as an efficient alternative to traditional chip-multiprocessors for parallel workloads. One of the main benefits of such architecture is its latency-hiding capabilit ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
The ability to perform fast context-switching and massive multi-threading is the forte of modern GPU architectures, which have emerged as an efficient alternative to traditional chip-multiprocessors for parallel workloads. One of the main benefits of such architecture is its latency-hiding capability. However, the efficacy of GPU’s latency-hiding varies signif-icantly across GPGPU applications. To investigate this, this paper first proposes a new al-gorithm that profiles execution behavior of GPGPU applica-tions. We characterize latencies caused by various pipeline hazards, memory accesses, synchronization primitives, and the warp scheduler. Our results show that the current round-robin warp scheduler works well in overlapping various la-tency stalls with the execution of other available warps for
Characterizing the Latency Hiding Ability of GPUs
- Proc. of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software
, 2014
"... The ability to perform fast context-switching and mas-sive multi-threading has been the forte of modern Graphics Processing Unit (GPU) architectures, which have emerged as an efficient alternative to traditional chip multiprocessors ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
The ability to perform fast context-switching and mas-sive multi-threading has been the forte of modern Graphics Processing Unit (GPU) architectures, which have emerged as an efficient alternative to traditional chip multiprocessors
Application-aware Memory System for Fair and Efficient Execution for Concurrent GPGPU Applications
- in Proceedings of GPGPU-7
, 2014
"... The available computing resources in modern GPUs are growing with each new generation. However, as many gen-eral purpose applications with limited thread-scalability are tuned to take advantage of GPUs, available compute re-sources might not be optimally utilized. To address this, modern GPUs will n ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
The available computing resources in modern GPUs are growing with each new generation. However, as many gen-eral purpose applications with limited thread-scalability are tuned to take advantage of GPUs, available compute re-sources might not be optimally utilized. To address this, modern GPUs will need to execute multiple kernels simul-taneously. As current generations of GPUs (e.g., NVIDIA Kepler, AMD Radeon) already enable concurrent execution of kernels from the same application, in this paper we ad-dress the next logical step: executing multiple concurrent applications in GPUs. We show that while this paradigm has a potential to improve the overall system performance, negative interactions among concurrently executing applica-tions in the memory system can severely hamper the per-
Warp-Aware Trace Scheduling for GPUs
"... GPU performance depends not only on thread/warp level parallelism (TLP) but also on instruction-level parallelism (ILP). It is not enough to schedule instructions within ba-sic blocks, it is also necessary to exploit opportunities for ILP optimization beyond branch boundaries. Unfortunately, modern ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
GPU performance depends not only on thread/warp level parallelism (TLP) but also on instruction-level parallelism (ILP). It is not enough to schedule instructions within ba-sic blocks, it is also necessary to exploit opportunities for ILP optimization beyond branch boundaries. Unfortunately, modern GPUs cannot dynamically carry out such optimiza-tions because they lack hardware branch prediction and can-not speculatively execute instructions beyond a branch. We propose to circumvent these limitations by adapting Trace Scheduling, a technique originally developed for mi-crocode optimization. Trace Scheduling divides code into traces (or paths), and optimizes each trace in a context-independent way. Adapting Trace Scheduling to GPU code requires revisiting and revising each step of microcode Trace Scheduling to attend to branch and warp behavior, identi-fying instructions on the critical path, avoiding warp diver-gence, and reducing divergence time. Here, we propose “Warp-Aware Trace Scheduling ” for GPUs. As evaluated on the Rodinia Benchmark Suite using dynamic profiling, our fully-automatic optimization achieves a geometric mean speedup of 1.10 × on a real system by increasing instructions executed per cycle (IPC) by a har-monic mean of 1.12 × and reducing instruction serialization and total instructions executed.
Adaptive and Transparent Cache Bypassing for GPUs
, 2015
"... In the last decade, GPUs have emerged to be widely adopted for general-purpose applications. To capture on-chip locality for these applications, modern GPUs have integrated multi-level cache hierarchy, in an attempt to reduce the amount and latency of the massive and sometimes irregular mem-ory acce ..."
Abstract
- Add to MetaCart
In the last decade, GPUs have emerged to be widely adopted for general-purpose applications. To capture on-chip locality for these applications, modern GPUs have integrated multi-level cache hierarchy, in an attempt to reduce the amount and latency of the massive and sometimes irregular mem-ory accesses. However, inferior performance is frequently attained due to serious congestion in the caches results from the huge amount of concurrent threads. In this paper, we propose a novel compile-time framework for adaptive and transparent cache bypassing on GPUs. It uses a simple yet effective approach to control the bypass degree to match the size of applications ’ runtime footprints. We validate the de-sign on seven GPU platforms that cover all existing GPU generations using 16 applications from widely used GPU benchmarks. Experiments show that our design can signifi-cantly mitigate the negative impact due to small cache sizes and improve the overall performance. We analyze the perfor-mance across different platforms and applications. We also propose some optimization guidelines on how to efficiently use the GPU caches.
A Survey of Techniques for Managing and . . .
- JOURNAL OF CIRCUITS, SYSTEMS, AND COMPUTERS
, 2014
"... Initially introduced as special-purpose accelerators for graphics applications, GPUs have now emerged as general purpose computing platforms for a wide range of applications. To address the requirements of these applications, modern GPUs include sizable hardware-managed caches. However, several fact ..."
Abstract
- Add to MetaCart
Initially introduced as special-purpose accelerators for graphics applications, GPUs have now emerged as general purpose computing platforms for a wide range of applications. To address the requirements of these applications, modern GPUs include sizable hardware-managed caches. However, several factors, such as unique architecture of GPU, rise of CPU-GPU heterogeneous computing etc., demand effective management of caches to achieve high performance and energy efficiency. Recently, several techniques have been proposed for this purpose. In this paper, we survey several architectural and system-level techniques proposed for managing and leveraging GPU caches. We also discuss the importance and challenges of cache management in GPUs. The aim of this paper is to provide the readers insights into cache management techniques for GPUs and motivate them to propose even better techniques for leveraging the full potential of caches in the GPUs of tomorrow.
Mascar: Speeding up GPU Warps by Reducing Memory Pitstops
"... Abstract-With the prevalence of GPUs as throughput engines for data parallel workloads, the landscape of GPU computing is changing significantly. Non-graphics workloads with high memory intensity and irregular access patterns are frequently targeted for acceleration on GPUs. While GPUs provide larg ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract-With the prevalence of GPUs as throughput engines for data parallel workloads, the landscape of GPU computing is changing significantly. Non-graphics workloads with high memory intensity and irregular access patterns are frequently targeted for acceleration on GPUs. While GPUs provide large numbers of compute resources, the resources needed for memory intensive workloads are more scarce. Therefore, managing access to these limited memory resources is a challenge for GPUs. We propose a novel Memory Aware Scheduling and Cache Access Re-execution (Mascar) system on GPUs tailored for better performance for memory intensive workloads. This scheme detects memory saturation and prioritizes memory requests among warps to enable better overlapping of compute and memory accesses. Furthermore, it enables limited re-execution of memory instructions to eliminate structural hazards in the memory subsystem and take advantage of cache locality in cases where requests cannot be sent to the memory due to memory saturation. Our results show that Mascar provides a 34% speedup over the baseline roundrobin scheduler and 10% speedup over the state of the art warp schedulers for memory intensive workloads. Mascar also achieves an average of 12% savings in energy for such workloads.