Results 1 - 10
of
18
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
"... Abstract—In a GPU, all threads within a warp execute the same instruction in lockstep. For a memory instruc-tion, this can lead to memory divergence: the memory requests for some threads are serviced early, while the remaining requests incur long latencies. This divergence stalls the warp, as it can ..."
Abstract
- Add to MetaCart
that prioritizes the few requests received from high cache utility warps to minimize stall time. We compare MeDiC to four cache management techniques, and find that it delivers an average speedup of 21.8%, and 20.1 % higher energy efficiency, over a state-of-the-art GPU cache management mechanism across 15
1Implementation of the DWT in a GPU through a Register-based Strategy
"... Abstract—The release of the CUDA Kepler architecture in March 2012 has provided Nvidia GPUs with a larger regis-ter memory space and instructions for the communication of registers among threads. This facilitates a new programming strategy that utilizes registers for data sharing and reusing in detr ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—The release of the CUDA Kepler architecture in March 2012 has provided Nvidia GPUs with a larger regis-ter memory space and instructions for the communication of registers among threads. This facilitates a new programming strategy that utilizes registers for data sharing and reusing
A GPU-based Algorithm-specific Optimization for High-performance Background Subtraction
"... Abstract-Background subtraction is an essential first stage in many vision applications differentiating foreground pixels from the background scene, with Mixture of Gaussians (MoG) being a widely used implementation choice. MoG's high computation demand renders a real-time single threaded real ..."
Abstract
- Add to MetaCart
bandwidth demand. In this paper, we propose a GPU implementation of Mixture of Gaussians (MoG) that surpasses real-time processing for full HD (1080p 60 Hz). This paper describes step-wise optimizations starting from general GPU optimizations (such as memory coalescing, computation & communication
Speeding up K-Means Algorithm by GPUs
- In Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology, Volume 1901155. Computer and Information Technology (CIT), 2010 IEEE 10th International Conference: IEEE Computer Society
"... Abstract—Cluster analysis plays a critical role in a wide variety of applications, but it is now facing the computational challenge due to the continuously increasing data volume. Parallel computing is one of the most promising solutions to overcoming the computational challenge. In this paper, we t ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
latency. For high-dimensional data sets, we design a novel algorithm which simulates matrix multiplication and exploits GPU on-chip registers and also on-chip shared memory to achieve high compute-to-memory-access ratio. As a result, our GPU-based k-Means algorithm is three to eight times faster than
A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs
"... Abstract In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly ad ..."
Abstract
- Add to MetaCart
Abstract In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly
Efficient Warp Execution in Presence of Divergence with Collaborative Context Collection
"... GPU’s SIMD architecture is a double-edged sword con-fronting parallel tasks with control flow divergence. On the one hand, it provides a high performance yet power-efficient platform to accelerate applications via massive parallelism; however, on the other hand, irregularities induce inefficiencies ..."
Abstract
- Add to MetaCart
-signment or by intra-warp load imbalance. CCC col-lects the relevant registers of divergent threads in a warp-specific stack allocated in the fast shared mem-ory, and restores them only when the perfect utiliza-tion of warp lanes becomes feasible. We propose code transformations to enable applicability of CCC to va
2 Speeding up K-Means Algorithm by GPUs
"... Abstract—Cluster analysis plays a critical role in a wide variety of applications; but it is now facing the computational challenge due to the continuously increasing data volume. Parallel computing is one of the most promising solutions to overcoming the computational challenge. In this paper, we t ..."
Abstract
- Add to MetaCart
-chip registers to significantly decrease the data access latency. For high-dimensional data sets, we design another novel algorithm that simulates matrix multiplication and exploits GPU on-chip shared memory to achieve high compute-to-memory-access ratio. Our experimental results show that our GPU-based k
Fat versus Thin Threading Approach on GPUs: Application to Stochastic Simulation of Chemical Reactions
"... Abstract—We explore two different threading approaches on a graphics processing unit (GPU) exploiting two different characteristics of the current GPU architecture. The fat thread approach tries to minimize data access time by relying on shared memory and registers potentially sacrificing parallelis ..."
Abstract
- Add to MetaCart
Abstract—We explore two different threading approaches on a graphics processing unit (GPU) exploiting two different characteristics of the current GPU architecture. The fat thread approach tries to minimize data access time by relying on shared memory and registers potentially sacrificing
Tsunami: Massively Parallel Homomorphic Hashing on Many-core GPUs
"... Homomorphic hash functions (HHF) play a key role in securing distributed systems that use coding techniques such as erasure coding and network coding. The computational complexity of HHFs remains to be a main challenge. In this paper, we present a massively parallel solution, named Tsunami, by explo ..."
Abstract
- Add to MetaCart
of Montgomery multiplication in order to decrease the demand of registers and shared memory and increase the utilization ratio of GPU processing cores; (3) using our own assembly code to implement the 32-bit integer multiplication, which outperforms the assembly codes generated by the native compiler by 20%; (4
1 Tsunami: Massively Parallel Homomorphic Hashing on Many-core GPUs
"... Homomorphic hash functions (HHF) play a key role in securing distributed systems that use coding techniques such as erasure coding and network coding. The computational complexity of HHFs remains to be a main challenge. In this paper, we present a massively parallel solution, named Tsunami, by explo ..."
Abstract
- Add to MetaCart
of Montgomery multiplication in order to decrease the demand of registers and shared memory and increase the utilization ratio of GPU processing cores; (3) using our own assembly code to implement the 32-bit integer multiplication, which outperforms the assembly codes generated by the native compiler by 20%; (4
Results 1 - 10
of
18