Results 1 -
5 of
5
VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units
- In Proc. of the 1st Innovative Parallel Computing (InPar
, 2012
"... Graphics processing units (GPUs) have been widely used for general-purpose computation acceleration. However, current programming models such as CUDA and OpenCL can support GPUs only on the local computing node, where the application execution is tightly coupled to the physical GPU hardware. In this ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Graphics processing units (GPUs) have been widely used for general-purpose computation acceleration. However, current programming models such as CUDA and OpenCL can support GPUs only on the local computing node, where the application execution is tightly coupled to the physical GPU hardware. In this work, we propose a virtual OpenCL (VOCL) framework to support the transparent utilization of local or remote GPUs. This framework, based on the OpenCL programming model, exposes physical GPUs as decoupledvirtualresourcesthatcanbetransparentlymanaged independent of the application execution. The proposed framework requires no source code modifications. We also propose various strategies for reducing the overhead caused by data communication and kernel launching and demonstrate about 85 % of the data write bandwidth and 90 % of the data read bandwidth compared to data write and read, respectively, in a native nonvirtualized environment. We evaluate the performance of VOCL using four real-world applications with various computation and memory access intensities and demonstrate that compute-intensive applications can execute with negligible overhead in the VOCL environment.
High-Performance Message Passing over generic Ethernet Hardware with Open-MX
, 2010
"... In the last decade, cluster computing has become the most popular high-performance computing architecture. Although numerous technological innovations have been proposed to improve the interconnection of nodes, many clusters still rely on commodity Ethernet hardware to implement message passing with ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In the last decade, cluster computing has become the most popular high-performance computing architecture. Although numerous technological innovations have been proposed to improve the interconnection of nodes, many clusters still rely on commodity Ethernet hardware to implement message passing within parallel applications. We present Open-MX, an open-source message passing stack over generic Ethernet. It offers the same abilities as the specialized Myrinet Express stack, without requiring dedicated support from the networking hardware. Open-MX works transparently in the most popular MPI implementations through its MX interface compatibility. It also enables interoperability between hosts running the specialized MX stack and generic Ethernet hosts. We detail how Open-MX copes with the inherent limitations of the Ethernet hardware to satisfy the requirements of message passing by applying an innovative copy offload model. Combined with a careful tuning of the fabric and of the MX wire protocol, Open-MX achieves better performance than TCP implementations, especially on 10 gigabit/s hardware.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P
"... high-performance by using the parallelism of a massive number of low-frequency/low-power processing cores. This means that the local preand post-communication processing required by the MPI stack might not be very fast, owing to the slow processing cores. Similarly, small amounts of serialization wi ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
high-performance by using the parallelism of a massive number of low-frequency/low-power processing cores. This means that the local preand post-communication processing required by the MPI stack might not be very fast, owing to the slow processing cores. Similarly, small amounts of serialization within the MPI stack that were acceptable on small/medium systems can be brutal on massively parallel systems. In this paper, we study different non-data-communication overheads within the MPI implementation on the IBM Blue Gene/P system. 1
Generalizing the Utility of GPUs in Large-Scale Heterogeneous Computing Systems
"... Abstract—Graphics processing units (GPUs) have been widely used as accelerators in large-scale heterogeneous computing systems. However, current programming models can only support the utilization of local GPUs. When using non-local GPUs, programmers need to explicitly call API functions for data co ..."
Abstract
- Add to MetaCart
Abstract—Graphics processing units (GPUs) have been widely used as accelerators in large-scale heterogeneous computing systems. However, current programming models can only support the utilization of local GPUs. When using non-local GPUs, programmers need to explicitly call API functions for data communication across computing nodes. As such, programming GPUs in large-scale computing systems is more challenging than local GPUs since local and remote GPUs have to be dealt with separately. In this work, we propose a virtual OpenCL (VOCL) framework to support the transparent virtualization of GPUs. This framework, based on the OpenCL programming model, exposes physical GPUs as decoupled virtual resources that can be transparently managed independent of the application execution. To reduce the virtualization overhead, we optimize the GPU memory accesses and kernel launches. We also extend the VOCL framework to support live task migration across physical GPUs to achieve load balance and/or quick system maintenance. Our experiment results indicate that VOCL can greatly simplify the task of programming cluster-based GPUs at a reasonable virtualization cost. Keywords-graphics processing unit (GPU), virtual OpenCL, task migration I.

