Results 1 - 10
of
52
Gdev: First-class GPU resource management in the operating system
- In Proc. of the USENIX Annual Technical Conf
, 2012
"... Graphics processing units (GPUs) have become a very powerful platform embracing a concept of heterogeneous many-core computing. However, application domains of GPUs are currently limited to specific systems, largely due to a lack of “first-class ” GPU resource management for general-purpose multi-ta ..."
Abstract
-
Cited by 24 (4 self)
- Add to MetaCart
(Show Context)
Graphics processing units (GPUs) have become a very powerful platform embracing a concept of heterogeneous many-core computing. However, application domains of GPUs are currently limited to specific systems, largely due to a lack of “first-class ” GPU resource management for general-purpose multi-tasking systems. We present Gdev, a new ecosystem of GPU resource management in the operating system (OS). It allows the user space as well as the OS itself to use GPUs as firstclass computing resources. Specifically, Gdev’s virtual memory manager supports data swapping for excessive memory resource demands, and also provides a shared device memory functionality that allows GPU contexts to communicate with other contexts. Gdev further provides a GPU scheduling scheme to virtualize a physical GPU into multiple logical GPUs, enhancing isolation among working sets of multi-tasking systems. Our evaluation conducted on Linux and the NVIDIA GPU shows that the basic performance of our prototype implementation is reliable even compared to proprietary software. Further detailed experiments demonstrate that Gdev achieves a 2x speedup for an encrypted file system using the GPU in the OS. Gdev can also improve the makespan of dataflow programs by up to 49 % exploiting shared device memory, while an error in the utilization of virtualized GPUs can be limited within only 7%. 1
GPUfs: Integrating a File System with GPUs
"... As GPU hardware becomes increasingly general-purpose, it is quickly outgrowing the traditional, constrained GPU-as-coprocessor programming model. To make GPUs easier to program and improve their integration with operating systems, we propose making the host’s file system directly accessible to GPU c ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
(Show Context)
As GPU hardware becomes increasingly general-purpose, it is quickly outgrowing the traditional, constrained GPU-as-coprocessor programming model. To make GPUs easier to program and improve their integration with operating systems, we propose making the host’s file system directly accessible to GPU code. GPUfs provides a POSIX-like API for GPU programs, exploits GPU parallelism for efficiency, and optimizes GPU file access by extending the host CPU’s buffer cache into GPU memory. Our experiments, based on a set of real benchmarks adapted to use our file system, demonstrate the feasibility and benefits of the GPUfs approach. For example, a self-contained GPU program that searches for a set of strings throughout the Linux kernel source tree runs over seven times faster than on an eight-core CPU.
Shredder: GPU-Accelerated Incremental Storage and Computation
"... Redundancy elimination using data deduplication and incremental data processing has emerged as an important technique to minimize storage and computation requirements in data center computing. In this paper, we present the design, implementation and evaluation of Shredder, a high performance content ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
(Show Context)
Redundancy elimination using data deduplication and incremental data processing has emerged as an important technique to minimize storage and computation requirements in data center computing. In this paper, we present the design, implementation and evaluation of Shredder, a high performance content-based chunking framework for supporting incremental storage and computation systems. Shredder exploits the massively parallel processing power of GPUs to overcome the CPU bottlenecks of content-based chunking in a cost-effective manner. Unlike previous uses of GPUs, which have focused on applications where computation costs are dominant, Shredder is designed to operate in both compute-and dataintensive environments. To allow this, Shredder provides several novel optimizations aimed at reducing the cost of transferring data between host (CPU) and GPU, fully utilizing the multicore architecture at the host, and reducing GPU memory access latencies. With our optimizations, Shredder achieves a speedup of over 5X for chunking bandwidth compared to our optimized parallel implementation without a GPU on the same host system. Furthermore, we present two real world applications of Shredder: an extension to HDFS, which serves as a basis for incremental MapReduce computations, and an incremental cloud backup system. In both contexts, Shredder detects redundancies in the input data across successive runs, leading to significant savings in storage, computation, and end-to-end completion times. 1
Fast and Flexible: Parallel Packet Processing with GPUs and Click
"... We introduce Snap, a framework for packet processing that out-performs traditional software routers by exploiting the parallelism available on modern GPUs. While obtaining high performance, it remains extremely flexible, with packet processing tasks imple-mented as simple modular elements that are c ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
(Show Context)
We introduce Snap, a framework for packet processing that out-performs traditional software routers by exploiting the parallelism available on modern GPUs. While obtaining high performance, it remains extremely flexible, with packet processing tasks imple-mented as simple modular elements that are composed to build fully functional routers and switches. Snap is based on the Click modular router, which it extends by adding new architectural features that support batched packet processing, memory structures optimized for offloading to coprocessors, and asynchronous scheduling with in-order completion. We show that Snap can run complex pipelines at high speeds on commodity PC hardware by building an IP router incorporating both an IDS-like full-packet string matcher and an SDN-like packet classifier. In this configuration, Snap is able to for-ward 40 million packets per second, saturating four 10 Gbps NICs at packet sizes as small as 128 byes. This represents an increase in throughput of nearly 4x over the baseline Click running comparable elements on the CPU. 1.
GPUSync: A Framework for Real-Time GPU Management ∗
"... The integration of graphics processing units (GPUs) into real-time systems has recently become an active area of research. However, prior research on this topic has failed to produce real-time GPU allocation methods that fully exploit the available parallelism in GPU-enabled systems. In this paper, ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
(Show Context)
The integration of graphics processing units (GPUs) into real-time systems has recently become an active area of research. However, prior research on this topic has failed to produce real-time GPU allocation methods that fully exploit the available parallelism in GPU-enabled systems. In this paper, a GPU management framework called GPUSync is described that enables increased parallelism while providing predictable real-time behavior. GPUSync can be applied in multi-GPU real-time systems, is cognizant of the system bus architecture, and fully exposes the parallelism offered by modern GPUs, even when closed-source GPU drivers are used. Schedulability tests presented herein that incorporate empirically measured overheads, including those due to bus contention, demonstrate that GPUSync offers real-time predictability and performs well in practice. 1
Dandelion: a compiler and runtime for heterogeneous systems
- in Proc. of the Twenty-Fourth ACM Symp. on Operating Systems Principles. ACM
"... Computer systems increasingly rely on heterogeneity to achieve greater performance, scalability and en-ergy efficiency. Because heterogeneous systems typi-cally comprise multiple execution contexts with differ-ent programming abstractions and runtimes, program-ming them remains extremely challenging ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Computer systems increasingly rely on heterogeneity to achieve greater performance, scalability and en-ergy efficiency. Because heterogeneous systems typi-cally comprise multiple execution contexts with differ-ent programming abstractions and runtimes, program-ming them remains extremely challenging. Dandelion is a system designed to address this pro-grammability challenge for data-parallel applications. Dandelion provides a unified programming model for heterogeneous systems that span diverse execution con-texts including CPUs, GPUs, FPGAs, and the cloud. It adopts the.NET LINQ (Language INtegrated Query) ap-proach, integrating data-parallel operators into general purpose programming languages such as C # and F#. It therefore provides an expressive data model and native language integration for user-defined functions, enabling programmers to write applications using standard high-level languages and development tools. Dandelion automatically and transparently distributes data-parallel portions of a program to available comput-ing resources, including compute clusters for distributed execution and CPU and GPU cores of individual nodes for parallel execution. To enable automatic execution of.NET code on GPUs, Dandelion cross-compiles.NET code to CUDA kernels and uses the PTask runtime [85] to manage GPU execution. This paper discusses the de-sign and implementation of Dandelion, focusing on the distributed CPU and GPU implementation. We evaluate the system using a diverse set of workloads. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. Copyright is held by the Owner/Author(s).
Zero-copy I/O processing for low-latency GPU computing
- in In Proceedings of the 4th ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS’13
, 2013
"... Cyber-physical systems (CPS) aim to monitor and control complex real-world phenomena where the computational cost and real-time constraints could be a major challenge. Many-core hardware accelerators such as graphics processing units (GPUs) promise to enhancing computation, leveraging the data paral ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
(Show Context)
Cyber-physical systems (CPS) aim to monitor and control complex real-world phenomena where the computational cost and real-time constraints could be a major challenge. Many-core hardware accelerators such as graphics processing units (GPUs) promise to enhancing computation, leveraging the data parallelism often found in real-world scenarios of CPS, but performance is limited by the overhead of the data trans-fer between the host and the device memory. For example, plasma control in the HBT-EP Tokamak device at Columbia University [11, 18] must execute the control algorithm in a few microseconds, but may take tens of microseconds to copy the data set between the host and the device memory. This paper presents a zero-copy I/O processing scheme that maps the I/O address space of the system to the virtual address space of the compute device, allowing sensors and actuators to transfer data to and from the compute device directly. Experiments using the plasma control system show a 33% reduction in computational cost, and microbenchmarks with more generic matrix operations show a 34 % reduction, while in both cases, effective data throughput remains at least as good as the current best performers.
A Scalable Framework for Heterogeneous GPU-Based Clusters
, 2012
"... GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is few parallel software that can utilize all CPU cores and all GPUs on the heterogeneous system efficien ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is few parallel software that can utilize all CPU cores and all GPUs on the heterogeneous system efficiently. On a heterogeneous cluster, the performance of a GPU (or a compute node) increases in a much faster rate than the performance of the PCI-Express connection (or the interconnection network) such that communication eventually becomes the bottleneck of the entire system. To overcome the bottleneck, we developed a multilevel partitioning and distribution method that guarantees a near-optimal communication volume. We have also extended heterogeneous tile algorithms to work on distributed-memory GPU clusters. Our main idea is to execute a serial program and generate hybrid-size tasks, and follow a dataflow programming model to fire the tasks on different compute nodes. We devised a distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently. The runtime system employs a novel distributed task-assignment protocol to solve data dependencies between tasks without coordination between processing units. The runtime system on each node consists of a number of CPU compute threads, a number of GPU compute threads, a task generation thread, an MPI communication
Design and Evaluation of the GeMTC Framework for GPU-enabled Many-Task Computing
"... We present the design and first performance and usability evaluation of GeMTC, a novel execution model and runtime system that enables accelerators to be programmed with many concurrent and independent tasks of potentially short or variable duration. With GeMTC, a broad class of such “many-task ” ap ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
(Show Context)
We present the design and first performance and usability evaluation of GeMTC, a novel execution model and runtime system that enables accelerators to be programmed with many concurrent and independent tasks of potentially short or variable duration. With GeMTC, a broad class of such “many-task ” applications can leverage the increasing number of accelerated and hybrid high-end computing systems. GeMTC overcomes the obstacles to using GPUs in a many-task manner by scheduling and launching independent tasks on hardware designed for SIMD-style vector processing. We demonstrate the use of a high-level MTC programming model (the Swift parallel dataflow language) to run tasks on many accelerators and thus provide a highproductivity programming model for the growing number of supercomputers that are accelerator-enabled. While still in an experimental stage, GeMTC can already support tasks of fine (subsecond) granularity and execute concurrent heterogeneous tasks on 86,000 independent GPU warps spanning 2.7M GPU threads on the Blue Waters supercomputer.
Robust Real-Time Multiprocessor Interrupt Handling Motivated by GPUs
"... Abstract—Architectures in which multicore chips are augmented with graphics processing units (GPUs) have great potential in many domains in which computationally intensive real-time workloads must be supported. However, unlike standard CPUs, GPUs are treated as I/O devices and require the use of int ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
(Show Context)
Abstract—Architectures in which multicore chips are augmented with graphics processing units (GPUs) have great potential in many domains in which computationally intensive real-time workloads must be supported. However, unlike standard CPUs, GPUs are treated as I/O devices and require the use of interrupts to facilitate communication with CPUs. Given their disruptive nature, interrupts must be dealt with carefully in real-time systems. With GPU-driven interrupts, such disruptiveness is further compounded by the closed-source nature of GPU drivers. In this paper, such problems are considered and a solution is presented in the form of an extension to LITMUS RT called klmirqd. The design of klmirqd targets systems with multiple CPUs and GPUs. In such settings, interruptrelated issues arise that have not been previously addressed. I.