Results 1 -
9 of
9
Static Analysis and Compiler Design for Idempotent Processing
"... Recovery functionality has many applications in computing systems, from speculation recovery in modern microprocessors to fault recovery in high-reliability systems. Modern systems commonly recover using checkpoints. However, checkpoints introduce overheads, add complexity, and often save more state ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
(Show Context)
Recovery functionality has many applications in computing systems, from speculation recovery in modern microprocessors to fault recovery in high-reliability systems. Modern systems commonly recover using checkpoints. However, checkpoints introduce overheads, add complexity, and often save more state than necessary. This paper develops a novel compiler technique to recover program state without the overheads of explicit checkpoints. The technique breaks programs into idempotent regions—regions that can be freely re-executed—which allows recovery without checkpointed state. Leveraging the property of idempotence, recovery can be obtained by simple re-execution. We develop static analysis techniques to construct these regions and demonstrate low overheads and large region sizes for an LLVM-based implementation. Across a set of diverse benchmark suites, we construct idempotent regions close in size to those that could be obtained with perfect runtime information. Although the resulting code runs more slowly, typical performance overheads are in the range of just 2-12%. The paradigm of executing entire programs as a series of idempotent regions we call idempotent processing, and it has many applications in computer systems. As a concrete example, we demonstrate it applied to the problem of compiler-automated hardware fault recovery. In comparison to two other state-of-the-art techniques, redundant execution and checkpoint-logging, our idempotent processing technique outperforms both by over 15%.
Idempotent Code Generation: Implementation, Analysis, and Evaluation
"... Leveraging idempotence for efficient recovery is of emerging interest in compiler design. In particular, identifying semantically idempotent code and then compiling such code to preserve the semantic idempotence property enables recovery with substantially lower overheads than competing software tec ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Leveraging idempotence for efficient recovery is of emerging interest in compiler design. In particular, identifying semantically idempotent code and then compiling such code to preserve the semantic idempotence property enables recovery with substantially lower overheads than competing software techniques. However, the efficacy of this technique depends on application-, architecture-, and compiler-specific factors that are not well understood. In this paper, we develop algorithms for the code generation of idempotent code regions and evaluate these algorithms considering how they are impacted by these factors. Without optimizing for these factors, we find that typical performance overheads fall in the range of roughly 10-15%. However, manipulating application idempotent region size typically improves the run-time performance of compiled code by 2-10%, differences in the architecture instruction set affect performance by up to 15%, and knowing in the compiler whether control flow side-effects can or cannot occur can impact performance by up to 10%. Overall, we find that, with small idempotent region and careful architecture- and application-specific tuning, it is possible to bring compiler performance overheads consistently down into the single-digit percentage range. The absolute best performance occurs when constructing the largest possible idempotent regions; to this end, however, better compiler support is needed. In the interest of spurring development in this area, we open-source our LLVM compiler implementation and make it available as a research tool.
Glider: A GPU Library Driver for Improved System Security
, 2014
"... Legacy device drivers implement both device resource man-agement and isolation. This results in a large code base with a wide high-level interface making the driver vulnerable to security attacks. This is particularly problematic for increas-ingly popular accelerators like GPUs that have large, comp ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Legacy device drivers implement both device resource man-agement and isolation. This results in a large code base with a wide high-level interface making the driver vulnerable to security attacks. This is particularly problematic for increas-ingly popular accelerators like GPUs that have large, complex drivers. We solve this problem with library drivers, a new driver architecture. A library driver implements resource man-agement as an untrusted library in the application process address space, and implements isolation as a kernel module that is smaller and has a narrower lower-level interface (i.e., closer to hardware) than a legacy driver. We articulate a set of device and platform hardware properties that are required to retrofit a legacy driver into a library driver. To demonstrate the feasibility and superiority of library drivers, we present Glider, a library driver implementation for two GPUs of pop-ular brands, Radeon and Intel. Glider reduces the TCB size and attack surface by about 35 % and 84 % respectively for a Radeon HD 6450 GPU and by about 38 % and 90 % respec-tively for an Intel Ivy Bridge GPU. Moreover, it incurs no performance cost. Indeed, Glider outperforms a legacy driver for applications requiring intensive interactions with the de-vice driver, such as applications using the OpenGL immediate mode API. 1.
3Leveraging GPUs Using Cooperative Loop Speculation
"... Graphics processing units, or GPUs, provide TFLOPs of additional performance potential in commodity com-puter systems that frequently go unused bymost applications. Even with the emergence of languages such as CUDA and OpenCL, programming GPUs remains a difficult challenge for a variety of reasons, ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Graphics processing units, or GPUs, provide TFLOPs of additional performance potential in commodity com-puter systems that frequently go unused bymost applications. Even with the emergence of languages such as CUDA and OpenCL, programming GPUs remains a difficult challenge for a variety of reasons, including the inherent algorithmic characteristics and data structure choices used by applications as well as the tedious performance optimization cycle that is necessary to achieve high performance. The goal of this work is to increase the applicability of GPUs beyond CUDA/OpenCL to implicitly data-parallel applications written in C/C++ using speculative parallelization. To achieve this goal, we propose Paragon: a static/dynamic compiler platform to speculatively run possibly data-parallel portions of sequential applications on the GPU while cooperating with the system CPU. For such loops, Paragon utilizes the GPU in an opportunistic way while orchestrating a cooperative relation between the CPU and GPU to reduce the overhead of miss-speculations. Paragon monitors the dependencies for the loops running speculatively on the GPU and nonspeculatively on the CPU using a lightweight distributed conflict detection designed specifically for GPUs, and transfers the execution to the CPU in case a conflict is detected. Paragon resumes the execution on the GPU after the CPU resolves the dependency. Our experiments show that Paragon achieves 4x on average and up to 30x speedup compared to unsafe CPU execution with four threads and 7x on average and up to 64x speedup versus sequential execution across a set of sequential but implicitly data-parallel applications.
u.ac.jp
"... state.edu GPGPUs are evolving from dedicated accelerators towards mainstream commodity computing resources. During the transition, the lack of system management of device memory space on GPGPUs has become a major hurdle. In existing GPGPU systems, device memory space is still managed ex-plicitly by ..."
Abstract
- Add to MetaCart
(Show Context)
state.edu GPGPUs are evolving from dedicated accelerators towards mainstream commodity computing resources. During the transition, the lack of system management of device memory space on GPGPUs has become a major hurdle. In existing GPGPU systems, device memory space is still managed ex-plicitly by individual applications, which not only increases the burden of programmers but can also cause application crashes, hangs, or low performance. In this paper, we present the design and implementation of GDM, a fully functional GPGPU device memory man-ager to address the above problems and unleash the com-puting power of GPGPUs in general-purpose environments. To effectively coordinate the device memory usage of dif-ferent applications, GDM takes control over device memory allocations and data transfers to and from device memory, leveraging a buffer allocated in each application’s virtual memory. GDM utilizes the unique features of GPGPU sys-tems and relies on several effective optimization techniques to guarantee the efficient usage of device memory space and to achieve high performance. We have evaluated GDM and compared it against state-of-the-art GPGPU system software on a range of workloads. The results show that GDM can prevent applications from crashes, including those induced by device memory leaks, and improve system performance by up to 43%.
Dynamic Orchestration of Massively Data Parallel Execution
, 2014
"... To my family ii ACKNOWLEDGEMENTS I would like to express my deep gratitude to my adviser Prof. Scott Mahlke for his guidance, enthusiastic encouragement and useful critique of this research work. I consider myself truly lucky to have worked with him these past years. He has shown incredible patience ..."
Abstract
- Add to MetaCart
(Show Context)
To my family ii ACKNOWLEDGEMENTS I would like to express my deep gratitude to my adviser Prof. Scott Mahlke for his guidance, enthusiastic encouragement and useful critique of this research work. I consider myself truly lucky to have worked with him these past years. He has shown incredible patience, served as an excellent mentor, and provided me every opportunity to succeed in this field. I also owe thanks to the remaining members of my dissertation committee, Prof. Mudge, Prof. Dick and Prof. Wenisch. They all devoted their time to help shape this research into what it has become today. I was lucky to be part of a research group whose members not only assisted me intellec-tually in my research but were also a comfort during those long nights before each deadline. Amir spent many hours to help me with this work. I can not imagine having this thesis in its current form without his support. Mojtaba provided signicant help with the Chapter III in this dissertation. Janghaeng also contributed signicantly, helping me with his amazing ideas. Anoushe Jamshidi did a great deal of work on the part of this thesis presented in Chapters V and VI. I want to thank Armin, Gaurav, Ankit, HK, Daya, Andrew and Shruti for all the discussions that we had and proofreading my papers. I would like to thank all my fellow labmates in the CCCP research group for their
ARCHITECTURAL SUPPORT FOR IRREGULAR PROGRAMS AND PERFORMANCE MONITORING FOR HETEROGENEOUS SYSTEMS
, 2014
"... Architectural support for irregular programs and performance monitoring for heterogeneous systems ..."
Abstract
- Add to MetaCart
(Show Context)
Architectural support for irregular programs and performance monitoring for heterogeneous systems
Orchestrating On-Chip Memory Resources for Throughput-Oriented Compilation
"... A key factor in GPU performance efficiency is the number of active threads that can run simultaneously on each streaming multi-processor. The active threads have their states saved on fast memory devices and can quickly be scheduled to run if the set of running threads stalls due to memory latency. ..."
Abstract
- Add to MetaCart
(Show Context)
A key factor in GPU performance efficiency is the number of active threads that can run simultaneously on each streaming multi-processor. The active threads have their states saved on fast memory devices and can quickly be scheduled to run if the set of running threads stalls due to memory latency. The greater number of active threads we have, the higher utilization we can obtain from many-core processor pipelines. To achieve optimal utilization, we typically need many more active threads than the number of physical cores. Due to limited on-chip memory resources including registers and scratch-pad memory, and the fact that every thread gets a equal partition of on-chip memory resource, the number of active threads depends on the characteristics of a given program and the back-end compilation efficiency in resource allocation. When a large and complicated program requires more registers per thread, the program performance may degrade significantly due to the decrease in the total number of active threads. In this paper, we propose a novel resource allocation approach for back-end compilation of throughput GPU processors. This approach leverages on-chip scratch-pad memory to reduce register pressure and increase GPU processor occupancy for maximum throughput. The scratch-pad memory serves as middle layer between register and long-latency off-chip memory. On one hand, it reduces register usage per-thread. On the other hand, it can serve as a caching layer for variables that need to be staged into registers from global memory. We have formulated the resource allocation problem for optimal utilization and throughput of many-core processors, and proposed efficient models and techniques. We implemented these techniques in a binary optimizer, and evaluated it on a set of realistic benchmarks on real GPUs. We demonstrated the effectiveness of our techniques by achieving up to 1.65 times speedup compared to the programs compiled by nvcc with highest optimization flag. 1.