• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Tools

Sorted by:
Try your query at:
Semantic Scholar Scholar Academic
Google Bing DBLP
Results 1 - 10 of 39
Next 10 →

DMA-Assisted, Intranode Communication in GPU Accelerated Systems

by Feng Ji, Ashwin M. Aji, James Dinan, Darius Buntinas, Pavan Balaji, Rajeev Thakur, Wu-chun Feng, Xiaosong Ma
"... Abstract—Accelerator awareness has become a pressing issue in data movement models, such as MPI, because of the rapid deployment of systems that utilize accelerators. In our previous work, we developed techniques to enhance MPI with accelerator awareness, thus allowing applications to easily and eff ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
and efficiently communicate data between accelerator memories. In this paper, we extend this work with techniques to perform efficient data movement between accelerators within the same node using a DMA-assisted, peer-to-peer intranode communication technique that was recently introduced for NVIDIA GPUs. We

Elastic Pipeline: Addressing GPU On-chip Shared Memory Bank Conflicts

by Chunyang Gou, Georgi N. Gaydadjiev
"... One of the major problems with the GPU on-chip shared mem-ory is bank conflicts. We observed that the throughput of the GPU processor core is often constrained neither by the shared memory bandwidth, nor by the shared memory latency (as long as it stays constant), but is rather due to the varied lat ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
-chip mem-ory bank conflicts on system throughput, by decoupling bank conflicts from pipeline stalls. Simulation results show that our proposed elastic pipeline together with the co-designed bank-conflict aware warp scheduling reduces the pipeline stalls by up to 64.0 % (with 42.3 % on average) and improves

Simultaneous Branch and Warp Interweaving for Sustained GPU Performance

by Nicolas Brunie, Ens De Lyon, Sylvain Collange, Gregory Diamos - in "39th Annual International Symposium on Computer Architecture (ISCA , 2012
"... Single-Instruction Multiple-Thread (SIMT) microarchitectures implemented in Graphics Processing Units (GPUs) run fine-grained threads in lockstep by grouping them into units, referred to as warps, to amortize the cost of instruction fetch, decode and control logic over multiple execution units. As i ..."
Abstract - Cited by 12 (0 self) - Add to MetaCart
introducing extra memory divergence. We consider (1) co-issuing instructions from different divergent paths of the same warp and (2) coissuing instructions from different warps. To support (1), we introduce a novel thread reconvergence technique that ensures threads are run back in lockstep at controlflow

1Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling

by Jianlong Zhong, Bingsheng He
"... Abstract—Graphics processors, or GPUs, have recently been widely used as accelerators in shared environments such as clusters and clouds. In such shared environments, many kernels are submitted to GPUs from different users, and throughput is an important metric for performance and total ownership co ..."
Abstract - Add to MetaCart
transparent memory management and PCI-e data transfer techniques, and dynamic slicing and scheduling techniques for kernel executions. With slicing, Kernelet divides a GPU kernel into multiple sub-kernels (namely slices). Each slice has tunable occupancy to allow co-scheduling with other slices for high GPU

Branch and data herding: Reducing control and memory divergence for error-tolerant GPU applications

by John Sartori, Rakesh Kumar - IEEE Trans. Multimedia
"... Abstract—Control and memory divergence between threads within the same execution bundle, or warp, have been shown to cause significant performance bottlenecks for GPU applications. In this paper, we exploit the observation that many GPU applications exhibit error tolerance to propose branch and data ..."
Abstract - Cited by 13 (2 self) - Add to MetaCart
Abstract—Control and memory divergence between threads within the same execution bundle, or warp, have been shown to cause significant performance bottlenecks for GPU applications. In this paper, we exploit the observation that many GPU applications exhibit error tolerance to propose branch

Real-Time GPU-Based 3D Ultrasound Reconstruction and Visualization

by Holger Ludvigsen, Supervisor Anne, Cathrine Elster, Co-supervisor Frank Lindseth , 2010
"... 3D ultrasound reconstruction can be used to generate volume data from tracked real-time 2D ultrasound frames. Compared to other imaging modalities like MRI and CT, ultrasound is a flexible low-cost solution for generating 3D image maps of the internal organs of the human body using existing 2D ultra ..."
Abstract - Add to MetaCart
ultrasound scanners. This makes ultrasound the modality of choice for intraoperative use and enables image guided surgery (e.g. neuro- or laparoscopic) where surgical instruments are safely navigated inside the human body. Current CPU-based methods for 3D ultrasound reconstruction are time consuming

Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations

by unknown authors
"... A GPU’s computing power lies in its abundant memory bandwidth and massive parallelism. However, its hardware thread schedulers, despite being able to quickly distribute computation to processors, often fail to capitalize on pro-gram characteristics effectively, achieving only a fraction of the GPU’s ..."
Abstract - Add to MetaCart
A GPU’s computing power lies in its abundant memory bandwidth and massive parallelism. However, its hardware thread schedulers, despite being able to quickly distribute computation to processors, often fail to capitalize on pro-gram characteristics effectively, achieving only a fraction of the GPU

Int J Parallel Prog DOI 10.1007/s10766-012-0201-1 Addressing GPU On-Chip Shared Memory Bank Conflicts Using Elastic Pipeline

by Chunyang Gou, Georgi N. Gaydadjiev, G. N. Gaydadjiev , 2011
"... © The Author(s) 2012. This article is published with open access at Springerlink.com Abstract One of the major problems with the GPU on-chip shared memory is bank conflicts. We analyze that the throughput of the GPU processor core is often con-strained neither by the shared memory bandwidth, nor by ..."
Abstract - Add to MetaCart
© The Author(s) 2012. This article is published with open access at Springerlink.com Abstract One of the major problems with the GPU on-chip shared memory is bank conflicts. We analyze that the throughput of the GPU processor core is often con-strained neither by the shared memory bandwidth, nor

RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads

by Amir Yazdanbakhsh, Gennady Pekhimenko, Bradley Thwaites, Hadi Esmaeilzadeh, Taesoo Kim, Onur Mutlu, Todd C. Mowry
"... This paper aims to tackle two fundamental memory bottle-necks: limited off-chip bandwidth (bandwidth wall) and long access latency (memory wall). To achieve this goal, our ap-proach exploits the inherent error resilience of a wide range of applications. We introduce an approximation technique, calle ..."
Abstract - Add to MetaCart
This paper aims to tackle two fundamental memory bottle-necks: limited off-chip bandwidth (bandwidth wall) and long access latency (memory wall). To achieve this goal, our ap-proach exploits the inherent error resilience of a wide range of applications. We introduce an approximation technique

A Novel Architecture for Resource Management in Active Networks Using a Directory Service

by Fariza Sabrina, Sanjay Jha - In Proceedings of ICT03 , 1999
"... This paper presents a framework for resource management in highly dynamic active networks. The goal is to allocate and manage active node resources in an efficient way while ensuring effective utilization of network and supporting load balancing. The framework supports co-existence of active and non ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
are facilitated through the DS, while within an active node the framework implements the composite scheduling algorithm to schedule memory, CPU and bandwidth to solve the combined resource scheduling problems. In addition, a flexible active node Knowledge base system has been introduced in order to resolve
Next 10 →
Results 1 - 10 of 39
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University